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Preface 



Welcome to the proceedings of the 8th European Conference on Computer Vi- 
sion! 

Following a very successful ECCV 2002, the response to our call for papers 
was almost equally strong - 555 papers were submitted. We accepted 41 papers 
for oral and 149 papers for poster presentation. 

Several innovations were introduced into the review process. First, the num- 
ber of program committee members was increased to reduce their review load. 
We managed to assign to program committee members no more than 12 papers. 
Second, we adopted a paper ranking system. Program committee members were 
asked to rank all the papers assigned to them, even those that were reviewed 
by additional reviewers. Third, we allowed authors to respond to the reviews 
consolidated in a discussion involving the area chair and the reviewers. Fourth, 
the reports, the reviews, and the responses were made available to the authors as 
well as to the program committee members. Our aim was to provide the authors 
with maximal feedback and to let the program committee members know how 
authors reacted to their reviews and how their reviews were or were not reflected 
in the final decision. Finally, we reduced the length of reviewed papers from 15 
to 12 pages. 

The preparation of ECCV 2004 went smoothly thanks to the efforts of the or- 
ganizing committee, the area chairs, the program committee, and the reviewers. 
We are indebted to Anders Heyden, Mads Nielsen, and Henrik J. Nielsen for 
passing on ECCV traditions and to Dominique Asselineau from ENST/TSI who 
kindly provided his GestRFIA conference software. We thank Jan-Olof Eklundh 
and Andrew Zisserman for encouraging us to organize ECCV 2004 in Prague. 
Andrew Zisserman also contributed many useful ideas concerning the organiza- 
tion of the review process. Olivier Faugeras represented the ECCV Board and 
helped us with the selection of conference topics. Kyros Kutulakos provided hel- 
pful information about the CVPR 2003 organization. David Vernon helped to 
secure ECVision support. 

This conference would never have happened without the support of the 
Centre for Machine Perception of the Czech Technical University in Prague. 
We would like to thank Radim Sara for his help with the review process and 
the proceedings organization. We thank Daniel Vecerka and Martin Matousek 
who made numerous improvements to the conference software. Petr Pohl helped 
to put the proceedings together. Martina Budosova helped with administrative 
tasks. Hynek Bakstein, Ondrej Chum, Jana Kostkova, Branislav Micusfk, Stepan 
Obclrzalek, Jan Soclrman, and Vft Zyka helped with the organization. 
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Abstract. We present a theory and algorithms for a generic calibration 
concept that is based on the following recently introduced general ima- 
ging model. An image is considered as a collection of pixels, and each 
pixel measures the light travelling along a (half-) ray in 3-space asso- 
ciated with that pixel. Calibration is the determination, in some com- 
mon coordinate system, of the coordinates of all pixels’ rays. This model 
encompasses most projection models used in computer vision or photo- 
grammetry, including perspective and affine models, optical distortion 
models, stereo systems, or catadioptric systems - central (single view- 
point) as well as non-central ones. We propose a concept for calibrating 
this general imaging model, based on several views of objects with known 
structure, but which are acquired from unknown viewpoints. It allows in 
principle to calibrate cameras of any of the types contained in the gene- 
ral imaging model using one and the same algorithm. We first develop 
the theory and an algorithm for the most general case: a non-central 
camera that observes 3D calibration objects. This is then specialized to 
the case of central cameras and to the use of planar calibration objects. 
The validity of the concept is shown by experiments with synthetic and 
real data. 



1 Introduction 

We consider the camera calibration problem, i.e. the estimation of a camera’s 
intrinsic parameters. A camera’s intrinsic parameters (plus the associated pro- 
jection model) give usually exactly the following information: for any point in 
the image, they allow to compute a ray in 3D along which light travels that falls 
onto that point (here, we neglect point spread). 

Most existing camera models are parametric (i.e. defined by a few intrinsic 
parameters) and address imaging systems with a single effective viewpoint (all 
rays pass through one point). In addition, existing calibration procedures are 
taylor-made for specific camera models. 

The aim of this work is to relax these constraints: we want to propose and 
develop a calibration method that should work for any type of camera model, 
and especially also for cameras without a single effective viewpoint. To do so, 
we first renounce on parametric models, and adopt the following very general 
model: a camera acquires images consisting of pixels; each pixel captures light 
that travels along a ray in 3D. The camera is fully described by: 

T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3022, pp. 1-13, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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— the coordinates of these rays (given in some local coordinate frame). 

— the mapping between rays and pixels; this is basically a simple indexing. 

This general imaging model allows to describe virtually any camera that 
captures light rays travelling along straight lines 1 . Examples (cf. figure 1): 

— a camera with any type of optical distortion, such as radial or tangential. 

— a camera looking at a reflective surface, e.g. as often used in surveillance, a 
camera looking at a spherical or otherwise curved mirror [10]. Such systems, 
as opposed to central catadioptric systems [3] composed of cameras and 
parabolic mirrors, do not in general have a single effective viewpoint. 

— multi-camera stereo systems: put together the pixels of all image planes; 
they “catch” light rays that definitely do not travel along lines that all pass 
through a single point. Nevertheless, in the above general camera model, a 
stereo system (with rigidly linked cameras) is considered as a single camera. 

— other acquisition systems, see e.g. [4,14,19], insect eyes, etc. 



Relation to previous work. See [9,17] for reviews and references on existing cali- 
bration methods and e.g. [6] for an example related to central catadioptric devi- 
ces. A calibration method for certain types of non-central catadioptric cameras 
(e.g. due to misalignment of mirror), is given in [2], 

The above imaging model has already been used, in more or less explicit 
form, in various works [8,12,13,14,15,16,19,23,24,25], and is best described in 
[8], were also other issues than sensor geometry, e.g. radiometry, are discussed. 
There are conceptual links to other works: acquiring an image with a camera 
of our general model may be seen as sampling the plenoptic function [1], and a 
light field [11] or lumigraplr [7] may be interpreted as a single image, acquired 
by a camera of an appropriate design. 

To our knowledge, the only previously proposed calibration approaches for 
the general imaging model, are due to Swaminatlran, Grossberg and Nayar [8, 
22]. The approach in [8] requires the acquisition of two or more images of a 
calibration object with known structure, and knowledge of the camera or object 
motion between the acquisitions. In this work, we develop a completely general 
approach, that requires taking three or more images of calibration objects, from 
arbitrary and unknown viewing positions. The approach in [22] does not 
require calibration objects, but needs to know the camera motion. Calibration 
is formulated as a non-linear optimization problem. In this work, “closed-form” 
solutions are proposed (requiring to solve linear equation systems). 

Other related works deal mostly with epipolar geometry estimation and mo- 
deling [13,16,24] and motion estimation for already calibrated cameras [12,15]. 

1 However, it would not work for example with a camera looking from the air, into 
water: still, to each pixel is associated a refracted ray in the water. However, when 
the camera moves, the refraction effect causes the set of rays to move non-rigidly, 
hence the calibration would be different for each camera position. 
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Organization. In §2, we explain the camera model used and give some notations. 
For ease of explanation and understanding, the calibration concept is first intro- 
duced for 2D cameras, in §3. The general concept for 3D cameras is described 
in §4 and variants (central vs. non-central camera and planar vs. 3D calibration 
objects) are developed in §5. Some experimental results are shown in §6, followed 
by discussions and conclusions. 

2 Camera Model and Notations 

We give the definition of the (purely geometrical) camera model used in this 
work. It is essentially the same as the model of [8] where in addition other issues 
such as point spread and radiometry are treated. We assume that a camera 
delivers images that consist of a set of pixels, where each pixel captures /measures 
the light travelling along some lralf-ray. In our calibration method, we do not 
model half-rays explicitly, but rather use their infinite extensions camera rays. 
Camera rays corresponding to different pixels need not intersect - in this general 
case, we speak of non-central cameras, whereas if all camera rays intersect in 
a single point, we have a central camera with an optical center. 

Furthermore, the physical location of the actual photosensitive elements that 
correspond to pixels, does in general not matter at all. On the one hand, this 
means that the camera ray corresponding to some pixel, needs not pass through 
that pixel, cf. figure 1. On the other hand, neighborship relations between pixels 
are in general not necessary to be taken into account: the set of a camera’s 
photosensitive elements may lie on a single surface patch (image plane), but may 
also lie on a 3D curve, on several surface patches or even be placed at completely 
isolated positions. In practice however, we do use some continuity assumption, 
useful in the stage of 3D-2D matching, as explained in §6: we suppose that 
pixels are indexed by two integer coordinates like in traditional cameras and that 
camera rays of pixels with neighboring coordinates, are “close” to one another. 

3 The Calibration Concept for 2D Cameras 

We consider here a camera and scene living in a 2D plane, i.e. camera rays are 
lines in that plane. Two images are acquired, while the imaged object undergoes 
some motion. Consider a single pixel and its camera ray, cf. figure 2. Figures 2 
(b) and (c) show the two points on the object that are seen by that pixel in the 
two images. We suppose to be able to determine the coordinates of these two 
points, in some local coordinate frame attached to the object (“matching”). 

The case of known motion. If the object’s motion between image acquisitions is 
known, then the two object points can be mapped to a single coordinate frame, 
e.g. the object’s coordinate frame at its second position, as shown in figure 2 
(d). Computing our pixel’s camera ray is then simply done by joining the two 
points. This summarizes the calibration approach proposed by Grossberg and 
Nayar [8], applied here for the 2D case. Camera rays are thus initially expressed 
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Fig. 1 . Examples of imaging systems, (a) Catadioptric system. Note that camera rays 
do not pass through their associated pixels, (b) Central camera (e.g. perspective, with or 
without radial distortion), (c) Camera looking at reflective sphere. This is a non-central 
device (camera rays are not intersecting in a single point), (d) Omnivergent imaging 
system [14,19]. (e) Stereo system (non-central) consisting of two central cameras. 



in a coordinate frame attached to the calibration object. This does not matter 
(all that counts are the relative positions of the rays), but for convenience, one 
would typically choose a better frame. For a central camera for example, one 
would choose the optical center as origin or for a non-central camera, the point 
that minimizes the sum of distances to the set of camera rays (if it exists) . 

Note that it is not required that the two images be taken of the same object; 
all that is needed is knowledge of point positions relative to coordinate frames 
of the objects, and the “motion” between the two coordinate frames. 

The case of unknown motion. This approach is no longer applicable and we 
need to estimate, implicitly or explicitly, the unknown motion. We show how to 
do this, given three images. Let Q, Q' and Q" be the points on the calibration 




Fig. 2. (a) The camera as black box, with one pixel and the associated camera ray. 
(b) The pixel sees a point on a calibration object, whose coordinates are identified in 
a frame associated with the object, (c) Same as (b), for another position of the object, 
(d) Due to known motion, the two points on the calibration object can be placed in 
the same coordinate frame. The camera ray is then determined by joining them. 
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objects, that are seen in the same pixel. These are 3-vectors of homogeneous 
coordinates, expressed in the respective local coordinate frame. Without loss 
of generality, we choose the coordinate frame associated with the object’s first 
position, as common frame. The unknown relative motions between the second 
and third frames and the first one, are given by 2 x 2 rotation matrices R' and R" 
and translation vectors t' and t". Note that i?' n = R 22 and R' 12 — —R' 2 1 (same 
for R"). Mapping the calibration points to the common frame gives points 



Q 




Q' 





Q" 



They must lie on the pixel’s camera ray, i.e. must be collinear. Hence, the 
determinant of the matrix composed of their coordinate vectors, must vanish: 



Q 1 R'nQ'i + R12Q2 + f i Q3 R'iiQ" + R12Q2 + t'lQ'l 
Q 2 R' 21 Q'i + R 22 Q 2 + t 2 Q' 3 R 21 Qi + R 22 Q 2 + t 2 Q 3 
Q 3 Q 3 Q’i 



= 0 



(1) 



Table 1. Non-zero coefficients of the trifocal calibration tensor for a general 2D camera. 
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Vi 
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Q1Q1Q3 + Q2Q2Q3 
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Q1Q2Q3 ~ Q2Q1Q3 


R22 
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Q1Q3Q1 + Q2Q3Q2 


—R'21 
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-R22 
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Q3Q1Q1 + Q3Q2Q2 


jd ' jd " td " td ' 
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td ' td " td " td ' 
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Q1Q3Q3 
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Q2Q3Q3 


-A + A 
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Q3Q1Q3 


R'nA - R21A 


10 


Q3Q2Q3 


Ri2t2 - R22A 


11 


Q3Q3Q1 


R'AA - R’AA 


12 


Q3Q3Q2 


R22A - R'AA 


13 


Q3Q3Q3 


AA - At ' 2 



This equation is trilinear in the calibration point coordinates. The equation’s 
coefficients may be interpreted as coefficients of a trilinear matching tensor; they 
depend on the unknown motions’ coefficients, and are given in table 1. In the 
following, we sometimes call this the calibration tensor. It is somewhat related 
to the homography tensor derived in [18]. Among the 3 • 3 • 3 = 27 coefficients 
of the calibration tensor, 8 are always zero and among the remaining 19, there 
are 6 pairs of identical ones. The columns of table 1 are interpreted as follows: 
the Ci are trilinear products of point coordinates and the V i: are the associated 
coefficients of the tensor. The following equation is thus equivalent to (1): 

13 

= 0 ■ (2) 

j= 1 

Given triplets of points Q, Q' and Q" for at least 12 pixels, we may compute 
the trilinear tensor up to an unknown scale A by solving a system of linear 
equations of type (2). Note that we have verified using simulated data, that 
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we indeed can obtain a unique solution (up to scale) for the tensor. The main 
problem is then that of extractin the motion parameters from the calibration 
tensor. In [21] we give a simple algorithm for doing so 2 . Once the motions are 
determined, the approach described above can be readily applied to compute 
the camera rays and thus to finalize the calibration. 

The special case of central cameras. It is worthwhile to specialize the calibration 
concept to the case of central cameras (but which are otherwise general, i.e. not 
perspective) . A central camera can already be calibrated from two views. Let Z 
be the homogeneous coordinates of the optical center (in the frame associated 
with the object’s first position). We have the following collinearity constraint: 

Z\ Qi R-nQi + R 12 Q 2 + t / ^ 21-^3 — ^ 22-^3 ^? 22-^2 — \ 

Z 2 Q 2 tt 2 iQi T R 22 Q 2 + ^2^3 = Q I ^22-^3 ^21-^3 — 1^21^2 — t? 2 2 -^l ]Q — 0 

Zz Qz Q'z \Z 3 t 2 — Z 2 Z\ — Zst^ Z2t' 1 — Zit 2 ) 

The bifocal calibration tensor in this equation is a 3 x 3 matrix and somewhat 
similar to a fundamental or essential matrix. It can be estimated linearly from 
calibration points associated with 8 pixels or more. It is of rank 2 and its right 
null vector is the optical center Z, which is thus easy to compute. Once this is 
done, the camera ray for a pixel can be determined e.g. by joining Z and Q. 

The special case of a linear calibration object. This is equally worthwhile to in- 
vestigate. We propose an algorithm in [21], which works but is more complicated 
than the algorithm for general calibration objects. 



4 Generic Calibration Concept for 3D Cameras 



This and the next section describe our main contributions. We extend the con- 
cept described in §3 to the case of cameras living in 3-space. We first deal with 
the most general case: non-central cameras and 3D calibration objects. 

In case of known motion, two views are sufficient to calibrate, and the 
procedure is equivalent to that outlined in §3, cf. [8]. In the following, we consider 
the practical case of unknown motion. Input are now, for each pixel, three 3D 
points Q, Q' and Q", given by 4- vectors of homogeneous coordinates, relative to 
the calibration object’s local coordinate system. Again, we adopt the coordinate 
system associated with the first image as global coordinate frame. The object’s 
motion for the other two images is given by 3 x 3 rotation matrices R' and R" 
and translation vectors t' and t". With the correct motion estimates, the aligned 
points must be collinear. We stack their coordinates in the following 4x3 matrix: 



f Qi R'nQi + R12Q2 + R13Q3 + t'iQ'4 R11Q1 + R12Q2 + R13Q3 + 

Q2 R21Q1 + R22Q2 + R23Q3 + t’ 2 Q4 RS1Q1 + R22Q2 + R23Q3 + t'fQ’l , , 

<?3 R31Q1 + R32Q2 + R33Q3 + t' 3 Q 4 R31Q1 + R32Q2 + R33Q3 + t'iQ'i ■ ( > 

yQr Qa Q'l / 

2 This is similar, though more complicated than extracting (ego-)motion of perspective 
cameras from the classical essential matrix [9]. 
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The collinearity constraint means that this matrix must be of rank less than 
3, which implies that all sub-determinants of size 3x3 vanish. There are 4 of 
them, obtained by leaving out one row at a time. Each of these corresponds to a 
trilinear equation in point coordinates and thus to a trifocal calibration tensor 
whose coefficients depend on the motion parameters. 

Table 2 gives the coefficients of the first two calibration tensors (all 4 are given 
in the appendix of [21]). For both, 34 out of 64 coefficients are always zero. One 
may observe that the two tensors share some coefficients, e.g. Vg = W\ = R' 31 . 

The tensors can be estimated by solving linear equation system, and we 
verified using simulated random experiments that in general unique solutions 
(up to scale) are obtained, if 3D points for sufficiently many pixels (29 at least) 
are available. In the following, we give an algorithm for computing the motion 
parameters. Let V( = A V) and IT)' = /ilT); ,i = 1 ... 37 be the estimated tensors 
(up to scale). The algorithm proceeds as follows. 

1. Estimate scale factors: A = + Eg 2 + V{q and /i = \JW 4 + W 2 + IUg 2 . 

2. Compute V) = and W) = ■^ i , i = 1 . . . 37 

3. Compute R' and R": 

(-Wn -W 16 -Wi 7 \ ( w ls w 19 w 20 \ 

R' = -Vi5 -Vw -Vi7 R" = v 18 Vi9 v 20 

V V 8 V 9 Vio J \-Vii -Via -Via/ 

They will not be orthonormal in general. We “correct” this as shown in [21]. 

4. Compute t' and t" by solving a straightforward linear least squares problem, 
which is guaranteed to have a unique solution, see [21] for details. 

Using simulations, we verified that the algorithm gives a unique and correct 
solution in general. 



5 Variants of the Calibration Concept 

Analogously to the case of 2D cameras, cf. §3, we developed important specia- 
lizations of our calibration concept, for central cameras and planar calibration 
objects. We describe them very briefly; details are given in [21]. 



Central cameras. In this case, two images are sufficient. Let Z be the optical 
center (unknown). By proceeding as in §3, we obtain 4 bifocal calibration tensors 
of size 4x4 and rank 2, that are somewhat similar to fundamental matrices. 
One of them is shown here: 



/ 0 0 0 0 \ 
R'otZ^ R'Z a 1 0, Z\ — A/ + Z\ ft, 

-R' 2[ z 4 -R'. 22 Z 4 —R' 2:i Z 4 Z 2 - Z 4 t' 2 

\R21Z3 ~ R31Z2 R22Z3 — R32Z2 R'23^3 ~ R33Z2 -^ 3^2 ~ ^2X3) 



It is relatively straightforward to extract the motion parameters and the optical 
center from these tensors. 
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Table 2 . Coefficients of two trifocal calibration tensors for a general 3D camera. 



i 


Ci 


Vi 


Wi 


i 


QiQ'iQ" 


0 


774 


2 


Q1Q2Q4 


0 


R'32 


3 


Q1Q3Q" 


0 


R'33 


4 


Q1Q4Q" 


0 


7/31 


5 


Q1Q4Q2 


0 


R32 


6 


Q1Q4Q3 


0 


-R'33 


7 


Q1Q4Q" 


0 


t'3 - t'j 


8 


Q2Q1Q4 


R'si 


0 


9 


Q2Q2Q4 


R32 


0 


10 


Q2Q3Q4 


R'33 


0 


11 


Q2Q4Q1 


-R'k 


0 


12 


Q2Q4Q2 


-R32 


0 


13 


Q2Q4Q3 


-R33 


0 


14 


Q2Q4Q4 


t'3 ~ 4 


0 


15 


Q3Q1Q4 


-R'21 


-R'u 


16 


Q3Q2Q" 


-R'22 


-R'l2 


17 


Q3Q3Q4 


7/4 


— R'l 3 


18 


Q3Q4Q1 


R'21 


R11 


19 


Q3Q4Q2 


R'22 


R12 



i 


Ci 


Vi 


Wi 


20 


Q3Q4Q3 


774 


Rl 3 


21 


Q3Q4Q4 


t'j - 1' 2 


t'j - /j 


22 


Q4Q1Q1 


jd' jd" jd" jd' 

^21^31 — ^21-^31 


JD' JD" JD" JD' 

-^11-^31 — ^l 1-7131 


23 


Q4Q1Q2 


JD f JD" JD" TD' 

-rt21-rt32 — xt 2 2-ft31 


JD' JD" JD" JD' 

-7111-7132 — -^12-^31 


24 


Q4Q1Q3 


jd' jd" jd" jd' 

^21^33 — ^23^31 


jd' jd" jd" jd' 

1111-7133 — -^13-^31 


25 


Q4Q1Q4 


7/44 - Rjiij 


R'nt'j - R'sitj 


26 


Q4Q2Q1 


JD' JD" JD" JD' 

M 22-7131 — -^21^32 


JD' JD" JD" JD' 

-7b 12 -7^3 1 — -7111-^32 


27 


Q4Q2Q2 


jd' jd" jd" jd' 

-^22-^32 — ^,22-^32 


jd' jd" jd" jd' 

-7M2JE32 — -7b 12 -7^32 


28 


Q4Q2Q3 


jd' jd" jd" jd' 

- rt 22 iX 33 — Jrl 23 ri 32 


jd' jd" jd" jd' 

-7bl2-77-33 — n 13 n 32 


29 


Q4Q2Q4 


R'22 c - Ry’i 


7 / 4/3 - 7 / 32 /" 


30 


Q4Q3Q1 


jd' jd" jd" jd' 

-7123-7131 — 1121-7 133 


jd' jd" jd" jd' 

Jbl 3/131 — -Til 1-TC33 


31 


Q4Q3Q2 


jd' jd" jd" jd' 

- rt 23 iX 32 — ^22 iX 33 


jd' jd" jd" jd' 

-7bl3-7i32 — ^12-7133 


32 


Q4Q3Q3 


JD' JD" JD" JD' 

It 23 ri 33 ~ Jrl 23 ri 33 


JD' JD" JD" JD' 

n 13 n 33 ~~ -7bl3-7i33 


33 


Q4Q3Q4 


743/3 - 77 ( 3/2 


7 / 4/3 - 7/4/i 


34 


Q4Q4Q1 


7 / 31/2 - 7/44 


7 / 4 /j - 7/44 


35 


Q4Q4Q2 


7 / 32/2 - R'22 1' 3 


7 / 4 /i - 7/4/3 


36 


Q4Q4Q3 


7/33/2 - 7/23/3 


7 / 4 /i - 7/44 


37 


Q 4Q4Q4 


4/3 - 4/2 


44 ' - 4'4 



Non-central cameras and planar calibration objects. The algorithm for this case 
is rather more complicated and not shown here. Using simulations, we proved 
that we obtain a unique solution in general. 

Central cameras and planar calibration objects. As with non-central cameras, 
we already obtain constraints on the motion parameters (and the optical center) 
from two views of the planar object. In this case however, the associated calibra- 
tion tensors do not contain sufficient information in order to uniquely estimate 
the motion and optical center. This is not surprising: even in the very restricted 
case of perspective cameras with 5 intrinsic parameters, two views of a planar 
calibration object do not suffice for calibration [20,26]. We thus developed an 
algorithm working with three views [21]. It is rather complicated, but was shown 
to provide unique solutions in general. 

6 Experimental Evaluation 

As mentioned previously, we verified each algorithm using simulated random 
experiments. This was first done using noiseless data. We also tested our methods 
using noisy data and obtained satisfying results. A detailled quantitative analysis 
remains yet to be carried out. 

We did various experiments with real images, using a 3M-Pixel digital camera 
with moderate optical distortions, a camera with a fish-eye lens and “home- 
made” catadioptric systems consisting of a digital camera and various curved 
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off-the-shelf mirrors. We used planar calibration objects consisting of black dots 
or squares on white paper. Figure 3 shows three views taken by the digital 
camera. 






Fig. 3. Top: images of 3 boards of different sizes, captured by a digital camera. Bottom: 
two views of the calibrated camera rays and estimated pose of the calibration boards. 



Dots/corners were extracted using the Harris detector. Matching of these 
image points to points on calibration objects was done semi-automatically. This 
gives calibration points for a sparse set of pixels per image, and in general there 
will be few, if any, pixels for which we get a calibration point in every view! 
We thus take into account the continuity assumption mentioned in §2. For every 
image, we compute the convex hull of the pixels for which calibration points were 
extracted. We then compute the intersection of the convex hulls over all three 
views, and henceforth only consider pixels inside that region. For every such 
pixel in the first image we estimate the calibration points for the second and 
third images using the following interpolation scheme: in each of these images, 
we determine the 4 closest extracted calibration points. We then compute the 
homography between these pixels and the associated calibration points on the 
planar object. The calibration point for the pixel of interest is then computed 
using that homography. 

On applying the algorithm for central cameras (cf. §5), we obtained the 
results shown in figure 3. The bottom row shows the calibrated camera rays and 
the pose of the calibration objects, given by the estimated motion parameters. It 
is difficult to evaluate the calibration quantitatively, but we observe that for every 
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pixel considered, the estimated motion parameters give rise to nearly perfectly 
collinear calibration points. Note also, cf. the bottom right figure, that radial 
distortion is correctly modeled: the camera rays are setwise coplanar, although 
the corresponding sets of pixels in the image are not perfectly collinear. 

The same experiment was performed for a fish-eye lens, cf. figure 4. The result 
is slightly worse - aligned calibration points are not always perfectly collinear. 
This experiment is preliminary in that only the central image region has been 
calibrated (cf. figure 4), due to the difficulty of placing planar calibration objects 
that cover the whole field of view. 




Fig. 4. Left: one of 3 images taken by the fish-eye lens (in white the area that was 
calibrated). Middle: calibrated camera rays and estimated pose of calibration objects. 
Right: image from the left after distortion correction, see text. 



Using the calibration information, we carried out two sample applications, as 
described in the following. The first one consists in correcting non-perspective di- 
stortions: calibration of the central camera model gives us a bunch of rays passing 
through a single point. We may cut these rays by a plane; at each intersection 
with a camera ray, we “paint” the plane with the “color” observed by the pixel 
associated with the ray in some input image. Using the same homography-based 
interpolation scheme as above, we can thus create a “densely” colored plane, 
which is nothing else than the image plane of a distortion-corrected perspective 
image. See figure 4 for an example. This model- free distortion correction scheme 
is somewhat similar to the method proposed in [5]. 

Another application concerns (ego-) motion and epipolar geometry estima- 
tion. Given calibration information, we can estimate relative camera pose (or 
motion), and thus epipolar geometry, from two or more views of an unknown 
object. We developed a motion estimation method similar to [15] and applied 
it to two views taken by the fish-eye lens. The epipolar geometry of the two 
views can be computed and visualized as follows: for a pixel in the first view, we 
consider its camera ray and determine all pixels of the second view whose rays 
(approximately) intersect the first ray. These pixels form the “epipolar curve” 
associated with the original pixel. An example is shown in figure 5. The estima- 
ted calibration and motion also allow of course to reconstruct objects in 3D (see 
[21] for examples). 
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Fig. 5. Epipolar curves for three points. These are not straight lines, but intersect in 
a single point, since we here use the central camera model. 



7 Discussion 

The algorithm for central cameras seems to work fine, even with the minimum 
input of 3 views and a planar calibration object. Experiments with non-central 
catadioptric cameras however did so far not give satisfying results. One reason 
for poor stability of the non-central method is the way we currently obtain our 
input (homography-based interpolation of calibration points) . We also think that 
the general algorithm, which is essentially based on solving linear equations, can 
only give stable results with minimum input (3 views) if the considered camera is 
clearly non-central. By this, we mean that there is not any point that is “close” 
to all camera rays; the general algorithm does not work for perspective cameras, 
but for multi-stereo systems consisting of sufficiently many cameras 3 . 

We propose several ideas for overcoming these problems. Most importantly, 
we probably need to use several to many images for a stable calibration. We have 
developed bundle adjustment formulations for our calibration problem, which is 
not straightforward: the camera model is of discrete nature and does not directly 
allow to handle sub-pixel image coordinates, which are for example needed in 
derivatives of a reprojection error based cost function. For initialization of the 
non-central bundle adjustment, we may use the (stabler) calibration results for 
the central model. Model selection may be applied to determine if the central 
or non-central model is more appropriate for a given camera. Another way of 
stabilizing the calibration might be the possible inclusion of constraints on the 
set of camera rays, such as rotational or planar symmetry, if appropriate. 

Although we have a single algorithm that works for nearly all existing ca- 
mera types, different cameras will likely require different designs of calibration 
objects, e.g. panoramic cameras vs. ones with narrow field of view. We stress 
that a single calibration can use images of different calibration objects; in our 
experiments, we actually use planar calibration objects of different sizes for the 
different views, imaged from different distances, cf. figure 3. This way, we can 

3 Refer to the appendix of [21] on the feasibility of the general calibration method for 
stereo systems consisting of three or more central cameras. 
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place them such that they do not “intersect” in space, which would give less 
stable results, especially for camera rays passing close to the intersection region. 
We also plan to use different calibration objects for initialization and bundle 
adjustment: initialization, at least for the central model, can be performed using 
the type of calibration object used in this work. As for bundle adjustment, we 
might then switch to objects with a much denser “pattern” e.g. with a coating 
consisting of randomly distributed colored speckles. Another possibility is to use 
a flat screen to produce a dense set of calibration points [8] . 

One comment on the difference between calibration and motion estimation: 
here, with 3 views of a known scene, we solve simultaneously for motion and ca- 
libration (motion is determined explicitly, calibration implicitly). Whereas once 
a (general) camera is calibrated, (ego-)motion can already be estimated from 2 
views of an unknown scene [15]. Hence, although our method estimates motion 
directly, we consider it a calibration method. 

8 Conclusions 

We have proposed a theory and algorithms for a highly general calibration con- 
cept. As for now, we consider this mainly as a conceptual contribution: we have 
shown how to calibrate nearly any camera, using one and the same algorithm. 

We already propose specializations that may be important in practice: an 
algorithm for central, though otherwise unconstrained cameras, is presented, as 
well as an algorithm for the use of planar calibration objects. Results of preli- 
minary experiments demonstrate that the approach allows to calibrate central 
cameras without using any parametric distortion model. 

We believe in our concept’s potential for calibrating cameras with “exotic” 
distortions - such as fislr-eye lenses with hemispheric field of view or catadi- 
optric cameras, especially non-central ones. We are working towards that goal, 
by developing bundle adjustment procedures to calibrate from multiple images, 
and by designing better calibration objects. These issues could bring about the 
necessary stability to really calibrate cameras without any parametric model in 
practice. Other ongoing work concerns the extension of classical structure-from- 
motion tasks such as motion and pose estimation and triangulation, from the 
perspective to the general imaging model. 
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Abstract. We present a General Linear Camera (GLC) model that uni- 
fies many previous camera models into a single representation. The GLC 
model is capable of describing all perspective (pinhole), orthographic, 
and many multiperspective (including pushbroom and two-slit) came- 
ras, as well as epipolar plane images. It also includes three new and 
previously unexplored multiperspective linear cameras. Our GLC mo- 
del is both general and linear in the sense that, given any vector space 
where rays are represented as points, it describes all 2D affine subspa- 
ces (planes) that can be formed by affine combinations of 3 rays. The 
incident radiance seen along the rays found on subregions of these 2D 
affine subspaces are a precise definition of a projected image of a 3D 
scene. The GLC model also provides an intuitive physical interpretation, 
which can be used to characterize real imaging systems. Finally, since the 
GLC model provides a complete description of all 2D affine subspaces, 
it can be used as a tool for first-order differential analysis of arbitrary 
(higher-order) multiperspective imaging systems. 



1 Introduction 

Camera models are fundamental to the fields of computer vision and plroto- 
grammetry. The classic pinhole and orthographic camera models have long ser- 
ved as the workhorse of 3D imaging applications. However, recent developments 
have suggested alternative multiperspective camera models [4,20] that provide 
alternate and potentially advantageous imaging systems for understanding the 
structure of observed scenes. Researchers have also recently shown that these 
multiperspective cameras are amenable to stereo analysis and interpretation [13, 
11 , 20 ], 

In contrast to pinhole and orthographic cameras, which can be completely 
characterized using a simple linear model (the classic 3 by 4 matrix [5]), multiper- 
spective cameras models are defined less precisely. In practice, multiperspective 
cameras models are described by constructions. By this we mean that a system 
or process is described for generating each specific class. While such physical 
models are useful for both acquisition and imparting intuition, they are not 
particularly amenable to analysis. 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3022, pp. 14-27, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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In this paper we present a unified General Linear Camera (GLC) model that 
is able to describe nearly all useful imaging systems. In fact, under an appropriate 
interpretation, it describes all possible linear images. In doing so it provides a 
single model that unifies existing perspective and multiperspecive cameras. 

2 Previous Work 

The most common linear camera model is the classic 3x4 pinhole camera matrix 
[5] , which combines six extrinsic and five intrinsic camera parameters into single 
operator that maps homogenous 3D points to a 2D image plane. These map- 
pings are unique down to a scale factor, and the same infrastructure can also 
be used to describe orthographic cameras. Recently, several researchers have 
proposed alternative camera representations known as multiperspective cameras 
which capture rays from different points in space. These multiperspective came- 
ras include pushbroom cameras [4] , which collect rays along parallel planes from 
points swept along a linear trajectory, and two-slit cameras [10], which collect 
all rays passing through two lines. Zomet et al [20] did an extensive analysis and 
modelling of two slit(XSlit) multiperspective cameras. However, they discuss the 
relationship of these cameras to pinhole cameras only for the purpose of image 
construction, whereas we provide a unifying model. 

Multiperspective imaging has also been explored in the field of computer 
graphics. Examples include multiple-center-of-projection images [12], manifold 
mosaics [11], and multiperspective panoramas [18]. Most multiperspective images 
are generated by stitching together parts of pinhole images [18,12], or slicing 
through image sequences [11,20]. 





Fig. 1 . General Linear Camera Model. a)A GLC is characterized by three rays ori- 
ginated from the image plane. b)It collects all possible affine combination of three 
rays. 



Seitz [13] has analyzed the space of multiperspective cameras to determine 
those with a consistent epipolar geometry. Their work suggests that some mul- 
tiperspective images can be used to analyze three-dimensional structure, just as 
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pinhole cameras are commonly used. We focus our attention on a specific class 
of linear multiperspective cameras, most of which satisfy Seitz’s criterion. 

Our analysis is closely related to the work of Gu et al [3] , which explored the 
linear structures of 3D rays under a particular 4D mapping known as a two-plane 
parametrization. This model is commonly used for light field rendering. Their 
primary focus was on the duality of points and planes under this mapping. They 
deduced that XSlits are another planar structure within this space, but they do 
not characterize all of the possible planar structures, nor discuss their analogous 
camera models. 

Our new camera model only describes the set of rays seen by a particular 
camera, not their distribution on the image plane. Under this definition pinhole 
cameras are defined by only 3 parameters (the position of the pinhole in 3D). 
Homograplries and other non-linear mappings of pinhole images (i.e. , radial di- 
stortion) only change the distribution of rays in the image plane, but do not 
change the set of rays seen. Therefore, all such mappings are equivalent under 
our model. 



3 General Linear Camera Model 

The General Linear Camera (GLC) is defined by three rays that originate from 
three points pi(ui, iq), p 2 (u 2 , v 2 ) and ^3(^3, V3) on an image plane I7j ma gei as is 
shown in Figure 1. A GLC collects radiance measurements along all possible “af- 
fine combinations” of these three rays. In order to define this affine combination 
of rays, we assume a specific ray parametrization. 

W.o.l.g, we define 7I; moffe to lie on z = 0 plane and its origin to coincide 
with the origin of the coordinate system. From now on, we call n image as II UV . 
In order to parameterize rays, we place a second plane II st at z = 1. All rays 
not parallel to n st ,II uv will intersect the two planes at (s,t, 1) and (u, v, 0) 
respectively. That gives a 4D parametrization of each ray in form ( s,t,u,v ). 
This parametrization for rays, called the two-plane parametrization (2PP), is 
widely used by the computer graphics community for representing light fields 
and lumigraphs [7,2]. Under this parametrization, an affine combination of three 
rays r,(sj, U, Uj, u,), i = 1, 2, 3, is defined as: 

r = a ■ (si,ti,ui,vi) + (3 ■ (s 2 ,t 2 ,u 2l v 2 ) + (1 - a - (3) ■ (s 3 , t 3 , u 3 , v 3 ) (1) 

The choice of 77 st at z = 1, is, of course, arbitrary. One can choose any 
plane parallel to I7 UV to derive an equivalent parametrization. Moreover, these 
alternate parameterizations will preserve affine combinations of three rays. 

Lemma 1. The affine combinations of any three rays under two different 2PP 
parameterizations that differ by choice of II st (i.e., ( s,t,u,v ) and (. s',t',u,v ) ) 
are the same. 

Proof. Suppose Tt s 't' is at some arbitrary depth zq, zq ^ 0. Consider the transfor- 
mation of a ray between the default parametrization (zq = 1) and this new one. 
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If r(s,t,u,v) and r(s' ,t' ,u,v) represent the same ray r in 3D, then r(s,t,u,v) 
must pass through (. s',t',Zo ), and there must exist some A such that 

A • (s, t, 1) + (1 - A) • (u, v, 0) = (s', t', zo) (2) 

Solving for A, we have 

s' = s ■ z 0 + u • (1 — zo), t' = t • Zq + v ■ (1 — Zq) (3) 

Since this transformation is linear, and affine combinations are preserved under 
linear transformation, the affine combinations of rays under our default two- 
plane parametrization (zq = 1) will be consistent for parameterizations over 
alternative parallel planes. Moreover, the affine weights for a particular choice 
of parallel 77 st are general. □ 

We call the GLC model “linear” because it defines all 2-dimensional affine 
subspaces in the 4-dimensional “ray space” imposed by a two-plane parametriza- 
tion. Moreover, these 2D affine subspaces of rays can be considered as images. 
We refer to the three rays used in a particular GLC as the GLC’s generator 
rays. Equivalently, a GLC can be described by the coordinates of two triangles 
with corresponding vertices, one located on 77 st , and the second on IJ UV . Unless 
otherwise specified, we will assume the three generator rays (in their 4D para- 
metrization) are linearly independent. This affine combination of generator rays 
also preserves linearity, while other parameterizations, such as the 6D Plucker 
coordinates [16], do not [3]. 

Lemma 2. If three rays are parallel to a plane 77 in 3D, then all affine combi- 
nations of them are parallel to 77 as well. 

Lemma 3. If three rays intersect a line l parallel to the image plane, all affine 
combinations of them will intersect l as well. 

Proof. By lemma 1, we can reparametrize three rays by placing 77 st so that it 
contains l resulting in the same set of affine combinations of the three rays. 
Because the st plane intersections of the three rays will lie on l, all affine com- 
binations of three rays will have their st coordinates on l, i.e., they will all pass 
through l. The same argument can be applied to all rays which pass through a 
given point. □ 



4 Equivalence of Classic Camera Models 

Traditional camera models have equivalent GLC representations. 

Pinhole camera: By definition, all rays of a pinhole camera pass through 
a single point, C in 3D space (the center of projection). Any three linearly 
independent rays from C will the intersect II UV and 77 si planes to form two 
triangles. These triangles will be similar and have parallel corresponding edges, 
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Fig. 2. Classic camera models represented as GLC. (a)Two similar triangles on two 
planes define a pinhole camera; (b)Two parallel congruent triangles define an ortho- 
graphic camera; (c) Three rays from an XSlit camera. 



as shown in Figure 2(a). Furthermore, any other ray, r, through C will intersect 
II UV and II s t planes at points p uv , and q st . These points will have the same affine 
coordinates relative to the triangle vertices on their corresponding planes, and 
r has the same affine coordinates as these two points. 

Orthographic camera: By definition, all rays on an orthographic camera 
have the same direction. Any three linearly independent rays from an orthogra- 
phic camera intersect parallel planes at the vertices of congruent triangles with 
parallel corresponding edges, as shown in Figure 2(b). Rays connecting the same 
affine combination of these triangle vertices, have the same direction as the 3 
generator rays, and will, therefore, originate from the same orthographic camera. 

Pushbroom camera: A pushbroom camera sweeps parallel planes along 
a line l collecting those rays that pass through l. We refer to this family of 
parallel planes as 77*. We choose II UV parallel to l but not containing /, and 
select a non-degenerate set of generator rays (they intersect II UV in a triangle). 
By Lemma 2 and 3, all affine combinations of the three rays must all lie on II* 
parallel planes and must also pass through l and, hence, must belong to the 
pushbroom camera. In the other direction, for any point p on II UV , there exist 
one ray that passes through p, intersects l and is parallel to 77*. Since p must be 
some affine combination of the three vertexes of the uv triangle, r must lie on 
the corresponding GLC. Furthermore, because all rays of the pushbroom camera 
will intersect II uv , the GLC must generate equivalent rays. 

XSlit camera: By definition, an XSlit camera collects all rays that pass 
through two non-coplanar lines. We choose 77 uv to be parallel to both lines 
but to not contain either of them. One can then pick a non-degenerate set of 
generator rays and find their corresponding triangles on 77 st and II uv . By Lemma 
3, all affine combinations of these three rays must pass through both lines and 
hence must belong to the XSlit camera. In the other direction, authors of XSlit 
[10,20] have shown that each point p on the image plane 77„„, maps to a unique 
ray r in an XSlit camera. Since p must be some affine combination of the three 
vertexes of the uv triangle, r must belong to the GLC. The GLC hence must 
generate equivalent rays as the XSlit camera. 

Epipolar Plane Image: EPI [1] cameras collect all rays that lie on a plane 
in 3D space. We therefore can pick any three linearly independent rays on the 
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plane as generator rays. Affine combinations of these rays generate all possible 
rays on the plane, so long as they are linearly independent. Therefore a GLC can 
also represent Epipolar Plane Images. 

5 Characteristic Equation of GLC 

Although we have shown that a GLC can represent most commonly used camera 
models, the representation is not unique (i.e., three different generator rays can 
define the same camera) . In this section we develop a criterion to classify general 
linear cameras. One discriminating characteristic of affine ray combinations is 
whether or not all rays pass through a line in 3D space. This characteristic is 
fundamental to the definition of many multi-perspective cameras. We will use 
this criteria to define the characteristic equation of general linear cameras. 

Recall that any 2D affine subspace in 4D can be defined as affine combinations 
of three points. Thus, GLC models can be associated with all possible planes 
in the 4D since GLCs are specified as affine combinations of three rays, whose 
duals in 4D are the three points. 

Lemma 4. Given a non-EPI, non-pinhole GLC, if all camera rays pass through 
some line l, not at infinity, in 3D space, then l must be parallel to II UV . 

Proof. We demonstrate the contrapositive. If l is not parallel to II UV , and all 
rays on a GLC pass through l, then we show the GLC must be either an EPI or 
a pinhole camera. 

Assume the three rays pass through at least two distinct points on l, other- 
wise, they will be on a pinhole camera, by Lemma 3. If l is not parallel, then it 
must intersect II st , II uv at some point (so,to, 1) and (wo,r>o,0). Gu et al [3] has 
shown all rays passing through l must satisfy the following bilinear constraints 

(u - u 0 )(t - t 0 ) - (v - u 0 )(s - s 0 ) = 0 (4) 

We show that the only GLCs that satisfy this constraint are EPIs or pinholes. 

All 2D affine subspaces in (s, t, u, v ) can be written as the intersection of two 
linear constraints A* • s + B t ■ t + Ci ■ u + ■ v + Ei =0, i = 1, 2. In general 

we can solve these two equations for two variables, for instance, we can solve for 
u-v as 

u = A[ ■ s + B[ ■ t + E[ , v = A! 2 • s + B' 2 ■ t + E' 2 (5) 

Substituting u and v into the bilinear constraint (4), we have 

(A^ • s + B’i • t + E'i — uo){t — ffi) = (Ar> • s + B '2 • t + E '2 — i>o)(s — so) (6) 

This equation can only be satisfied for all s and t if A[ = B' 2 and B[ = A' 2 = 0, 
therefore, equation (5) can be rewritten as u = A' ■ s + E[ and v = A! ■ t + E' 2 . 
Gu et al [3] have shown all rays in this form must pass through a 3D point P 
{P cannot be at infinity, otherwise all rays have uniform directions and cannot 
all pass through any line l, not at infinity). Therefore all rays must lie on a 3D 
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plane that passes through l and finite P. The only GLC camera in which all rays 
lie on a 3D plane is an EPI. If the two linear constraints are singular in u and 
v, we can solve for s-t, and similar results hold. 

If the two linear constraints cannot be solved for u-v or s-t but can be solved 
for u-s or v-t, then a similar analysis results in equations of two parallel lines, 
one on II st , the other on II UV . The set of rays through two parallel lines must 
lie on an EPI. □ 

Lemma 3 and 4 imply that given a GLC, we need only consider if the three 
generator rays pass through some line parallel to II st . We use this relationship 
to define the characteristic equation of a GLC. 

The three generator rays in a GLC correspond to the following 3D lines: 

Ti — A, * (s^ , , 1) T (1 Ai) • (U'l , Vi , 0) i — 1,2,3 

The three rays intersect some plane B z -\ parallel to II UV when Ai = A2 = A3 = 
A. By Lemma 3, all rays on the GLC pass through some line l on II z -\ if the 
three generator rays intersect l. Therefore, we only need to test if there exist any 
A so that the three intersection points of the generator rays with I7 z =a he on a 
line. A necessary and sufficient condition for 3 points on a constant 2-plane to 
be co- linear is that they have zero area on that plane. This area is computed as 
follows (Note: the value of 2 is unnecessary): 



(A • si + (1 — A) • jq) 
(A • s 2 + (1 - A) • u 2 ) 
(A • s 3 + (1 - A) • u 3 ) 



(A • t\ + (1 — A) • ui) 1 

(A • t 2 + (1 — A) • v 2 ) 1 

(A • t 3 + (1 — A) • v 3 ) 1 



= 0 



Notice equation (7) is a quadratic equation in A of the form 

A- \ 2 + B-\ + C = 0 



( 7 ) 

(8) 



where 





Sl — Ml tl — Vl 1 




Sl Vl 1 




tl Ul 1 




Ml Vl 1 




Ml Ml 1 


A== 


S 2 ~ U2 t 2 — V2 1 


, B = 


S2 V2 1 


- 


t 2 U2 1 


- 2 • 


U2 V2 1 


, G — 


M2 V2 1 




S3 — W,3 h ~ V 3 1 




S3 V3 1 




t 3 «3 1 




M3 V3 1 




U3 V 3 1 



We call equation (8) the characteristic equation of a GLC. Since the charac- 
teristic equation can be calculated from any three rays, one can also evaluate the 
characteristic equation for EPI and pinhole cameras. The number of solutions of 
the characteristic equation implies the number of lines that all rays on a GLC 
pass through. It may have 0, 1, 2 or infinite solutions. The number of solutions 
depends on the denominator A and the quadratic discriminant A = B 2 — 4 AC. 

We note that the characteristic equation is invariant to translations in 4D. 
Equivalently, translations of the two triangles formed by generator rays (s', f') = 
(si + T s ,ti + T t ), = ( ui + T u ,Vi + T v ), i = 1, 2, 3, do not change the 

coefficients A, B and C of equation (8). 
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6 Characterizing Classic Camera Models 

In this section, we show how to identify standard camera models using the cha- 
racteristic equation of 3 given generator rays. 

Lemma 5. Given a GLC, three generator rays, and its characteristic equation 
A ■ A 2 + B ■ A + C = 0, then all rays are parallel to some plane if and only if 
A = 0. 

Proof. Notice in the matrix used to calculate A, row i is the direction di of ray 
r t . Therefore A can be rewritten as A = ( di x d 2 ) ■ d 3 . Hence A = 0 if and 
only if d\, d 2 and d% are parallel to some 3D plane. And by Lemma 2, all affine 
combinations of these rays must also be parallel to that plane if A = 0. □ 



6.1 A = 0 Case 



When A = 0, the characteristic equation degenerates to a linear equation, which 
can have 1, 0, or an infinite number of solutions. By Lemma 5, all rays are 
parallel to some plane. Only three standard camera models satisfy this condition: 
pushbroom, orthographic, and EPI. 

All rays of a pushbroom lie on parallel planes and pass through one line, as 
is shown in Figure 4(a). A GLC is a pushbroom camera if and only if A = 0 and 
the characteristic equation has 1 solution. 

All rays of an orthographic camera have the same direction and do not all 
simultaneously pass through any line l. Hence its characteristic equation has no 
solution. The zero solution criteria alone, however, is insufficient to determine 
if a GLC is orthographic. We show in the following section that one can twist 
an orthographic camera into bilinear sheets by rotating rays on parallel planes, 
as is shown in Figure 4(b), and still maintain that all rays do not pass through 
a common line. In Section 3, we showed that corresponding edges of the two 
congruent triangles of an orthographic GLC must be parallel. This parallelism 
is captured by the following expression: 



(•Si •‘b ) 
(ti - tj) 



{Uj - Uj) 
0 i - Vj) 



i,j = 1, 2, 3 and « ^ j 



(9) 



We call this condition the edge-parallel condition. It is easy to verify that a GLC 
is orthographic if and only if A = 0, its characteristic equation has no solution, 
and it satisfies the edge-parallel condition. 

Rays of an EPI camera all lie on a plane and pass through an infinite number 
of lines on the plane. In order for a characteristic equation to have infinite number 
of solutions when A = 0, we must also have B = 0 and C = 0. This is not 
surprising, because the intersection of the epipolar plane with n st and II UV 
must be two parallel lines and it is easy to verify A = 0, B = 0 and C = 0 if 
and only if the corresponding GLC is an EPI. 
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6.2 A ^ 0 Case 

When A ^ 0, the characteristic equation becomes quadratic and can have 0, 1, 
or 2 solutions, which depends on the characteristic equation’s discriminant A. 
We show how to identify the remaining two classical cameras, pinhole and XSlit 
cameras in term of A and A. 

All rays in a pinhole camera pass through the center of projection (COP). 
Therefore, any three rays from a pinhole camera, if linearly independent, cannot 
all be parallel to any plane, and by Lemma 4, A 0. Notice that the roots of 
the characteristic equation correspond to the depth of the line that all rays pass 
through, hence the characteristic equation of a pinhole camera can only have one 
solution that corresponds to the depth of the COP, even though there exists an 
infinite number of lines passing through the COP. Therefore, the characteristic 
equation of a pinhole camera must satisfy 4/0 and A = 0. However, this 
condition alone is insufficient to determine if a GLC is pinhole. In the following 
section, we show that there exists a camera where all rays lie on pencil of planes 
sharing a line, as shown in Figure 4(c), which also satisfies these conditions. One 
can, however, reuse the edge-parallel condition to verify if a GLC is pinhole. 
Thus a GLC is pinhole, if and only if A 0, has one solution, and it satisfies 
edge-parallel condition. 

Rays of an XSlit camera pass through two slits and, therefore, the characte- 
ristic equation of a GLC must have at least two distinct solutions. Furthermore, 
Pajdla [10] has shown all rays of an XSlit camera cannot pass through lines other 
than its two slits, therefore, the characteristic equation of an XSlit camera has 
exactly two distinct solutions. Thus, a GLC is an XSlit if and only if A 0 and 
A > 0. 

7 New Multiperspective Camera Models 

The characteristic equation also suggests three new multiperspective camera 
types that have not been previously discussed. They include l)twisted ortho- 
graphic: A = 0, the equation has no solution, and all rays do not have uniform 
direction; 2)pencil camera: A ^ 0 and the equation has one root, but all rays 
do not pass through a 3D point; 3)bilinear camera: A 0 and the characteristic 
equation has no solution. In this section, we give a geometric interpretation of 
these three new camera models. 

Before describing these camera models, however, we will first discuss a helpful 
interpretation of the spatial relationships between the three generator rays. An 
affine combination of two 4D points defines a 1-dimensional affine subspace. 
Under 2PP, a 1-D affine subspaces corresponds to a bilinear surface S in 3D 
that contains the two rays associated with each 4D point. If these two rays 
intersect or have the same direction in 3D space, S degenerates to a plane. Next, 
we consider the relationship between ray r 3 and S. We define r% to be parallel 
to S if and only if r 3 has the same direction as some ray r £ S. This definition 
of parallelism is quite different from conventional definitions. In particular, if r 3 




General Linear Cameras 



23 




Fig. 3. Bilinear Surfaces, (a) rz is parallel to S'; (b) r3 is parallel to S, but still intersects 
S; (c) r3 is not parallel to S, and does not intersect S either. 



is parallel to S, r 3 can still intersect S. And if r 3 is not parallel to S, r 3 still 
might not intersect S, Figure 3 (b) and (c) show examples of each case. 

This definition of parallelism, however, is closely related to A in the charac- 
teristic equation. If r 3 is parallel to S, by definition, the direction of r3 must be 
some linear combination of the directions of r\ and r<2, and, therefore, A — 0 
by Lemma 5 . A = 0 , however, is not sufficient to guarantee r 3 is parallel to S. 
For instance, one can pick two rays with uniform directions so that A = 0 , yet 
still have the freedom to pick a third so that it is not parallel to the plane, as is 
shown in Figure 3 (c). 

The number of solutions to the characteristic equation is also closely related 
to the number of intersections of r 3 with S. If r 3 intersects the bilinear surface 
S(r 1 , 7*2) at P, then there exists a line l, where P € l, that all rays pass through. 
This is because one can place a constant-z plane that passes through P and 
intersects 7*1 and r2 at Q and R. It is easy to verify that P, Q and R lie on a 
line and, therefore, all rays must pass through line PQR. Hence r 3 intersecting 
S (7* 1 , 7*2) is a sufficient condition to ensure that all rays pass through some line. 
It further implies if the characteristic equation of a GLC has no solution, no two 
rays in the camera intersect. GLCs whose characteristic equation has no solution 
are examples of the oblique camera from [ 9 ] . 






Fig. 4. Pushbroom, Twisted Orthographic, and Pencil Cameras, (a) A pushbroom 
camera collects rays on a set of parallel planes passing through a line; (b) A twisted 
orthographic camera collects rays with uniform directions on a set of parallel planes; 
(c) A pencil camera collects rays on a set of non-parallel planes that share a line. 
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7.1 New Camera Models 

Our GLC model and its characteristic equation suggests 3 new camera types 
that have not been previously described. 

Twisted Orthographic Camera: The characteristic equation of the twi- 
sted orthographic camera satisfies A = 0, has no solution, and its generators do 
not satisfy the edge-parallel condition. If r\, r 2 and rq are linearly independent, 
no solution implies r% will not intersect the bilinear surface S. In fact, no two 
rays intersect in 3D space. In addition, A = 0 also implies that all rays are par- 
allel to some plane 77 in 3D space, therefore the rays on each of these parallel 
planes must have uniform directions as is shown in Figure 4(b). Therefore, twi- 
sted orthographic camera can be viewed as twisting parallel planes of rays in an 
orthographic camera along common bilinear sheets. 

Pencil Camera: The characteristic equation of a pencil camera satisfies 
i / 0, has one solution and the generators do not satisfy the edge-parallel 
condition. In Figure 4(c), we illustrate a sample pencil camera: rays lie on a 
pencil of planes that share line l. In a puslrbroom camera, all rays also pass 
through a single line. However, puslrbroom cameras collect rays along planes 
transverse to l whereas the planes of a pencil camera contains l (i.e., lie in the 
pencil of planes through l), as is shown in Figure 4(a) and 4(c). 

Bilinear Camera: By definition, the characteristic equation of a bilinear 
camera satisfies A 0 and the equation has no solution (A < 0). Therefore, 
similar to twisted orthographic cameras, no two rays intersect in 3D in a bilinear 
camera. In addition, since 4 / 0, no two rays are parallel either. Therefore, any 
two rays in a bilinear camera form a non-degenerate bilinear surface, as is shown 
in Figure 3(a). The complete classification of cameras is listed in Table 1. 



Table 1 . Characterize General Linear Cameras by Characteristic Equation 



Characteristic Equation 


2 Solution 


1 Solution 


0 Solution 


Inf. Solution 


A ± 0 


XSlit 


Pencil/Pinholef 


Bilinear 


0 


A = 0 


0 


Pushbroom 


Twisted/Ortho.f 


EPI 



f: A GLC satisfying edge-parallel condition is pinhole(A ^ 0) or orthographic (A = 0). 



7.2 All General Linear Cameras 

Recall that the characteristic equation of a GLC is invariant to translation, 
therefore we can translate (si, t\) to (0, 0) to simplify computation. Furthermore, 
we assume the uv triangle has canonical coordinates (0, 0), (1, 0) and (0, 1). This 
gives: 

A = 52^3 — S3^2 — S 2 — 73 + 1, A = (s 2 — t^) 2 + 4s3f2 (10) 

The probability that A = 0 is very small, therefore, puslrbroom, orthographic and 
twisted orthographic cameras are a small subspace of GLCs. Furthermore since 
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S 2 , t 2 , S 3 and ^3 are independent variables, we can, by integration, determine that 
approximately two thirds of all possible GLCs are XSlit, one third are bilinear 
cameras, and remainders are other types. 



8 Example GLC Images 

In Figure 5, we compare GLC images of a synthetic scene. The distortions of the 
curved isolines on the objects illustrate various multi-perspective effects of GLC 
cameras. In Figure 6, we illustrate GLC images from a 4D light field. Each GLC 
is specified by three generator rays shown in red. By appropriately transforming 
the rays on the image plane via a 2D homography, most GLCs generate easily 
interpretable images. In Figure 7, we choose three desired rays from different 
pinhole images and fuse them into a multiperspective bilinear GLC image. 




Fig. 5. Comparison between synthetic GLC images. From left to right, top row: a 
pinhole, an orthographic and an EPI; middle row: a pushbroom, a pencil and an twisted 
orthographic; bottom row: a bilinear and an XSlit. 



9 Conclusions 

We have presented a General Linear Camera (GLC) model that unifies perspec- 
tive (pinhole), orthographic and many multiperspective (including pushbroom 
and two-slit) cameras, as well as Epipolar Plane Images (EPI). We have also 
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Fig. 6. GLC images created from a light field. Top row: a pencil, bilinear, and push- 
broom image. Bottom row: an XSlit, twisted orthographic, and orthographic image. 




Fig. 7. A multiperspective bilinear GLC image synthesized from three pinhole cameras 
shown on the right. The generator rays are highlighted in red. 



introduced three new linear multiperspective cameras that have not been pre- 
viously explored: they are twisted orthographic, pencil and bilinear cameras. We 
have further deduced the characteristic equation for every GLC from its three 
generator rays and have shown how to use it to classify GLCs into eight canonical 
camera models. 

The GLC model also provides an intuitive physical interpretation between 
lines, planar surfaces and bilinear surfaces in 3D space, and can be used to 
characterize real imaging systems like mirror reflections on curved surface. Since 
GLCs describes all possible 2D affine subspaces in 4D ray space, they can used be 
as a tool for first-order differential analysis of these high-order multiperspective 
imaging systems. GLC images can be rendered directly by ray tracing a syn- 
thetic scene, or by cutting through pre-captured light fields. By appropriately 
organizing rays, all eight canonical GLCs generate interpretable images similar 
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to pinhole and orthographic cameras. Furthermore, we have shown one can fuse 
desirable features from different perspectives to form any desired multiperspec- 
tive image. 
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Abstract. Our goal is to match contour lines between images and to re- 
cover structure and motion from those. The main difficulty is that pairs 
of lines from two images do not induce direct geometric constraint on 
camera motion. Previous work uses geometric attributes — orientation, 
length, etc. — for single or groups of lines. Our approach is based on 
using Pencil-of-Points (points on line) or pops for short. There are many 
advantages to using pops for structure-from-motion. The most impor- 
tant one is that, contrarily to pairs of lines, pairs of pops may constrain 
camera motion. We give a complete theoretical and practical framework 
for automatic structure-from-motion using pops — detection, matching, 
robust motion estimation, triangulation and bundle adjustment. For wide 
baseline matching, it has been shown that cross-correlation scores com- 
puted on neighbouring patches to the lines gives reliable results, given 
2D homographic transformations to compensate for the pose of the pat- 
ches. When cameras are known, this transformation has a 1-dimensional 
ambiguity. We show that when cameras are unknown, using pops lead 
to a 3-dimensional ambiguity, from which it is still possible to reliably 
compute cross-correlation. We propose linear and non-linear algorithms 
for estimating the fundamental matrix and for the multiple-view trian- 
gulation of pops. Experimental results are provided for simulated and 
real data. 



1 Introduction 

Recovering structure and motion from images is one of the key goals in computer 
vision. A common approach is to detect and match image features while reco- 
vering camera motion. The goal of this paper is the automatic matching of lines 
and recovery of structure and motion. This problem is difficult for the reason 
that a pair of corresponding lines does not give direct geometric constraint on 
the camera motion. Hence, one has to work on a three- view basis or assume that 
camera motion is known a priori, e.g. [10]. 

In this paper, we attack directly the two view case by introducing a type of 
image primitive that we call Pencil-of-Points or pop for short. A POP is made 
of a supporting line and a set of supporting points lying on the supporting line. 
Physically, a POP corresponds to a set of interest points on a contour line. POPs 
can be built on the top of most contour lines. Contrarily to pairs of corresponding 
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lines, pairs of corresponding POPs may give geometric constraints on camera 
motion, provided that what we call the local geometry , relating corresponding 
points along the supporting lines, has been computed. We exploit these geometric 
constraints for matching POPs and recovering structure and motion. Once camera 
motion has been recovered using pops, it can be employed for a reliable guided- 
matching and reconstruction of other types of features. 

The closest work to ours is [10]. The main difference is that the authors con- 
sider that the cameras are known and propose a wide-baseline guided-matching 
algorithm for lines. They show that reliable results are obtained based on cross- 
correlation scores, computed by warping the neighbouring textures of the lines 
using the 2D homography H(/z) ~ [1'] x F + /re' 1 T , where 1 -fA L are corresponding 
lines, F is the fundamental matrix and e' the second epipole. The projective 
parameter /i is computed by minimizing the cross-correlation score. 

Before going into further details about our approach, we underline some of 
the advantages of using pops for automatic structure and motion recovery. First, 
a POP has fewer degrees of freedom than the supporting line and the individual 
supporting points which implies that (i) its localization is often more accurate 
that those of the individual features, (ii) finding pops in a set of interest points 
and contour lines increase their individual repeatability rate and (Hi) structure 
and motion parameters estimated from POPs are more accurate than that re- 
covered from points and/or lines. Second, matching or tracking pops through 
images is more reliable than individual contour lines or interest points, since a 
pair of corresponding POPs defines a local geometry, used to score matching hypo- 
theses based on geometric or photometric criteria. Third, the robust estimation 
of camera motion based on random sampling from putative correspondences, i.e. 
in a RANSAC-like manner [3], is more efficient using POPs than other standard 
features, since only three pairs of POPs define a fundamental matrix, versus seven 
pairs of points. 



Contributions and paper organization. Using POPs for structure-from-motion is a 
new concept. We propose a comprehensive framework for multiple- view matching 
and recovery of structure and motion. Our framework is based on the following 
traditional steps, which also give the organization of this paper. 

First, §2, we investigate the detection of pops in images and their matching. 
We define and study the local geometry of a pair of pops. We propose methods 
for its estimation, which allow to obtain putative POP correspondences, from 
which the epipolar geometry can be robustly estimated. 

Second, §3, we propose techniques for estimating the epipolar geometry from 
POP correspondences. Minimal and redundent cases are studied. 

Third, §4, we tackle the problem of triangulating pops from multiple ima- 
ges. We derive and approximate the optimal (in the Maximum Likelihood sens) 
solution by an algorithm based on the triangulation of the supporting line, then 
the supporting points. 

Finally, bundle adjustment is described in §5. We provide experimental re- 
sults on simulated data and give our conclusions and further work in §§6 and 
7 respectively. Experimental results on real data are provided throughout the 
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paper. The following two paragraphs give our notation, some preliminaries and 
definitions. 

Notation and preliminaries. We make no formal distinction between coordinate 
vectors and physical entities. Equality up to a non-null scale factor is denoted by 
~. Vectors are typeset using bold font (q, Q), matrices using sans-serif fonts (F, 
H) and scalars in italic (a). Transposition and transposed inverse are denoted 
by T and -T . The (3 x 3) skew-symmetric cross-product matrix is written as 
in [q] x x = q x x. Indices are used to indicate the size of a matrix or vector 
(F( 3 x 3 ), q( 3 xi))j to index a set of entities (q.;) or to select coefficients of matrices 
or vectors (qi, Index i is used for the n images, j for the m features and 
k for the p supporting points of a pop 1 . The supporting lines are written l t7 
(the supporting line of the j-th pop in image i ) and supporting points as q^f. 
(the fc-th supporting point of the j-th pop in image i). Indices are sometimes 
dropped for clarity. The identity matrix is written I and the null- vector as 0. We 
use the Euclidean distance between points, denoted d e and an algebraic distance 
defined by: 



d 2 (q,u) = ||S[q] x u|| 2 with S = (JS8). (1) 

Definitions. A pencil of points is a set of p supporting points lying on a sup- 
porting line. If p > 3, the POP is said to be complete , otherwise, it is said to be 
incomplete. A complete correspondence is a correspondence of complete pops. 
As shown in the next section, only complete correspondences may define a local 
geometry. 

We distinguish two kinds of correspondences of POPs: line-level and point-level 
correspondences. A line-level correspondence means that only the supporting 
lines are known to match. A point-level correspondence is stronger and means 
that a point-to-point mapping along the supporting lines has been established. 



2 Detecting and Matching Pencil-of-Points 

2.1 Detecting 

Detecting pops in images is the first step of the structure-from-motion process. 
One of the most important properties of a detector is its ability to achieve 
repeatability rates 2 as high as possible, which reflects the fact that it can detect 
the same features in different images. In order to ensure high repeatability rates, 
we formulate our POP detector based on interest points and contour lines, for 
which there exist detectors achieving high repeatability rates, see [9] for interest 
points and [2] for contour lines. 

In order to detect salient pops, we merge nearby contour lines. Algorithms 
based on the Hough transform or RANSAC [3] can be used to detect POPs within 

1 To simplify the notation, we assume without loss of generality that all pops have 
the same number of supporting points. 

2 The repeatability rate between two images is the number of corresponding features 
over the mean number of detected points [9], 




A Framework for Pencil-of-Points Structure-from-Motion 31 

a set of points and/or lines. We propose the following simple solution. First, an 
empty POP is instantiated for each line (which gives the supporting line). Second, 
each point is attached to the POPs whose supporting line is at a distance lower 
than a threshold, that we typically choose as a few pixels. Finally, incomplete 
POPs, i.e. those for which the number of supporting points is less than three, are 
eliminated. Note that we use a loose threshold for interest point and contour line 
detection, to get as many as possible pops. The less significant interest points 
and contour lines are generally pruned as they are respectively not attached to 
any POP or form incomplete POPs. An example of POP detection is shown on 
figures 1 (a) & (b). It is observed that the repeatability rate of pops is higher 
than each of the repeatability rates of points and lines. 




(a) (b) (c) (d) 



Fig. 1. (a) & (b) Show the detected pops. The repeatability rate is 51% while for 
points and lines it is lower, respectively 41% and 37%. (c) & (d) show the 9 putative 
matches obtained with our algorithm. On this example, all of them are correct, which 
shows the robustness of our local geometry based cross-correlation measure. 



2.2 Matching 

Traditional structure-from-motion algorithms using interest points usually rely 
on an initial matching, followed by the robust estimation of camera geometry and 
a guided-matching step, see e.g. [6]. The initial matching step is often based on 
similarity measures between points such as correlation or grey-value invariants. 
Guided-matching uses the estimated camera geometry to constrain the search- 
area. In the case of pops, the initial matching step is based on the local geometry 
defined by a pair of pops. This step is described below followed by the robust 
estimation of the epipolar geometry. 



Matching Based on Local Geometry. As mentioned above, the idea is to use 
the local geometry defined by a pair of POPs. We show that this local geometry 
is modeled by a ID homography and allows to establish dense correspondences 
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between the two supporting lines. Given a hypothesized line-level POP corre- 
spondence, we upgrade it to point-level by computing its local geometry. Given 
a point-level correspondence, a similarity score can be computed using cross- 
correlation, in a manner similar to [10]. For each pop in one image, the score 
is computed for all POPs in the other image and a ‘winner takes all’ scheme is 
employed to extract a set of putative POP matches. Putative matches obtained 
by our algorithm are shown on figures 1 (c) & (d). 



Defining and computing the local geometry. We study the local geometry induced 
by a point-level correspondence, and propose an estimation method. 

Proposition 1. Corresponding supporting points are linked by a ID homogra- 
phy, related to the epipolar transformation, relating corresponding epipolar lines. 

Proof: Corresponding supporting points lie on corresponding epipolar lines: there 
is a trivial one-to-one correspondence between supporting points and epipolar 
lines (provided the supporting lines do not contain the epipoles). The proof 
follows from the fact that the epipolar pencils are related by a ID lromography 
[12]. ■ 

First, we shall define a local P 1 parameterization of the supporting points, 
using two Euclidean transformation matrices A and A' acting such that the 
supporting lines are rotated to be vertical and aligned with the y- axes of the 
images. The transformed supporting points are x*. ~ Aq^ ~ (0 y k 1) T and 
x fc ~ A'q' fc ~ (0 y' k 1) . Second, we introduce a ID lromography g as: 



which is equivalent to x' ~ G(/x)x with G (/u) ~ (ul gi 32 V 

~ (/r 1 f-i ‘2 y. 3 ) represents projective parameters which are significant only 
when G(/x) is applied to points off the supporting line. The 2D lromography 
mapping corresponding points along the supporting lines is H(/x) ~ A' 1 G(/x)A. 

The ID lromography g can be estimated from p > 3 pairs of supporting 
points using equation (2). This is the reason why complete pops are defined as 
those which have at least 3 supporting points. Given g, H (/x) can be formed. 



where the 3-vector 



Computing H(/x). The above-described algorithm can not be applied directly 
since at this stage, we only have line-level POP correspondence hypotheses. We 
have to upgrade them to point-level to estimate H(/x) with the previously-given 
algorithm and score them by computing cross-correlation. We propose the follo- 
wing algorithm: 

— for all valid pairs of triplets of supporting points 3 : 

• compute the local geometry represented by H(/x). 

• compute the cross-correlation score based on H(/x), see below. 

— return the H(/x) corresponding to the highest cross-correlation score. 

3 Valid triplets satisfy an ordering constraint, namely middle points have to match. 
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Computing cross-correlation. For a pair of POPs, the matching score is obtained 
by evaluating the cross-correlation using H(/x) to associate corresponding points. 
The cross-correlation is evaluated within rectangular strips centered onto the 
supporting lines. The length of the strips are given by the overlap of the suppor- 
ting lines in each image. The width of the strips must be sufficiently large for 
cross-correlation to be discriminative. During our experiments, we found that 
a width of 3 to 7 pixels was appropriate. For pixels off the supporting lines, 
the fi parameters are significant. The following solutions are possible: compute 
these parameters by minimizing the cross-correlation score, as in [10], or use the 
median luminance and chrominance of the regions adjacent to the supporting 
lines [1]. The first solution is computationally too expensive to be used in our 
inner loop, since 3 parameters have to be estimated, while the second solution 
is not discriminative enough. We propose to map pixels along lines perpendicu- 
lar to the supporting lines. Hence, the method uses neighbouring texture while 
being independent of p. In order to take into account a possible non-planarity 
surrounding the supporting lines, we weight the contribution of each pixel to 
cross-correlation proportionally to the inverse of its distance to the supporting 
line. 

Robustly Computing the Epipolar Geometry. At this stage, we are given 
a set of putative POP correspondences. We employ a robust estimator, allowing 
to estimate the epipolar geometry and to discriminate between inliers and ou- 
tliers. We use a scheme based on RANSAC [3], which maximizes the number of 
inliers. In order to use RANSAC, one must provide a minimal estimator , i.e. an 
estimator which computes the epipolar geometry from the minimum number of 
correspondences, and a function to discriminate between inliers and outliers, gi- 
ven an hypothesized epipolar geometry. The number of trials required to ensure 
a good probability of success, say 0.99, depends on the minimal number of cor- 
respondences needed to compute the epipolar geometry. Our minimal estimator 
described in §3 needs 3 pairs of pops. Applying a RANSAC procedure is therefore 
much more efficient with POPs than with points: with 50% of outliers, 35 trials 
are sufficient with pops, while 588 trials are required for points (values taken 
from [6]). 

Our inlier/outlier discriminating function is based on computing the cross- 
correlation score using [10]. Inliers are selected by thresholding this score. We use 
a threshold of few percents (2% — 5%) of the maximal grey value. Figures 2 (a-d) 
show an example of epipolar geometry computation, and the set of corresponding 
pops obtained after guided-matching based on the method of [10]. 

3 Computing the Epipolar Geometry 

Proposition 2. The minimal number of pairs of POPs in general position 4 nee- 
ded to define a unique fundamental matrix is 3. 

Proof: Due to lack of space, this proof is left for an extended version of the paper. 

4 General position means that the supporting lines are not coplanar and do not lie on 
an epipolar plane, i.e. the image lines do not contain the epipoles. 
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(a) (b) (c) (d) 



Fig. 2. (a) & (b) Show a representative set of corresponding epipolar lines while (c) & 
(d) show the 11 matched lines obtained after guided-matching using the algorithm of 
[ 10 ]- 



3.1 The ‘Eight Corrected Point’ Algorithm 

This linear estimator is based on the constraints induced by the supporting 
points. Pairs of supporting points q^. -f-l q'- fc are obtained based on the pre- 
viously estimated local geometries H(/x). The first idea that comes to mind is 
to use the supporting points as input to the eight point algorithm [7]. This al- 
gorithm minimizes an algebraic distance between predicted epipolar lines and 
observed points. The eight corrected point algorithm consists in correcting the 
position of the supporting points, i.e. to make them colinear, prior to applying 
the eight point algorithm. Using this procedure reduces the noise on the points 
positions, as we shall verify experimentally. 

3.2 The ‘Three Pop’ Algorithm 

This linear algorithm compares observed points and predicted points. This al- 
gorithm is more statistically meaningful than the eight point algorithm, in the 
case of POPs, in that observed and predicted features are directly compared. 

We wish to predict the supporting point positions. We intersect the predic- 
ted epipolar lines, i.e. Fq^ in the second image, with the supporting lines 1': 
the predicted point is given by [l'] x Fq^. Our cost function is given by sum- 
ming the squared algebraic distances between observed and predicted points: 
e?a(q' fc , [l'] x Fq ifc ). In order to obtain a symmetric criterion, we consider pre- 
dicted and observed points in the first image also, which yields: 

Ca = £ E [y x F T q' fc ) + dl( q' fe , [1'] x Fq jfc )) . (3) 

3 k 

After introducing explicitly d a from equation (1) and minor algebraic mani- 
pulations, we obtain the matrix form C a = . XZfcCII Byfcf || 2 + ||B' fc f|| 2 ) where 

f = vect(F) is the row-wise vectorization of F and: 
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B jk = S[qjfc] x [lj ] x (q'jk, il 1 3 1 ) , B'- fc = S [q'j, ] x [1 j ] x diag ( qj fc qj fe qj fc ) . 

The cost function becomes C 0 = || Bf|| 2 with B T ~ B' 1:L T ... B^ p B' mp T y 

The singular vector associated to the smallest singular value of B gives the f that 
minimizes C a . Similarly to the eight point algorithm, the obtained fundamental 
matrix does not satisfy the rank-deficiency constraint in general, and has to be 
corrected by nullifying its smallest singular value, see e.g. [6]. 



3.3 Non-linear ‘Reduced’ Estimation 

The previously-described three POP estimator is statistically sound in the sense 
that observed and predicted points are compared in the linear cost function (3). 
However, the comparison is done using the algebraic distance d a ■ This is the 
price to pay to get a linear estimator. In this section, we consider a cost function 
with a similar form, but using the Euclidean distance d e to compare observed 
and predicted points: 

Ce = EE K(q^’ [y xF T q' fc ) + dl{ q' fe , [l'] x Fq jfe )) . (4) 

j k 

We use the Levenberg-Marquart algorithm, see e.g. [6], with a suitable parame- 
terization of the fundamental matrix [12] to minimize this cost function, based 
on the initial solution provided by the three POP algorithm. 



4 Multiple-View Triangulation 

We deal with the triangulation of POP seen in multiple views. Note that since 
the triangulation of a line is independent from the others, we drop the index j 
in this section. 



4.1 Optimal Triangulation 

The optimal 3D POP is the one which better explains the data, i.e. which mi- 
nimizes the sum of squared Euclidean distances between predicted and obser- 
ved supporting points. Assuming that 3D pops are represented by two points 
M and N for the supporting line and p scalars oik for the supporting points 
Q k ~ ctfcM + (1 — «fc)N, the following non-linear problem is obtained: 

n p 

min C pop with C pop = V' V' dl(Pi(a k M + (1 - a fc )N), q ife ). (5) 

JVl.N ,Otu .... z ' 

2=1 k = 1 



We use the Levenberg-Marquart algorithm, e.g. [6]. We examine the difficult 
problem of finding a reliable initial solution in the next section. 
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4.2 Initialization 

Finding an initial solution which is close to the optimal one is of primary im- 
portance. The initialization method must minimize a cost function as close as 
possible to (5). We propose a two-step initialization algorithm consisting in tri- 
angulating the supporting line, then each supporting point. Our motivations for 
these steps are explained while reviewing line triangulation below. 

Line Triangulation. Line triangulation from multiple views is a standard 
structure- from-motion problem and has been widely studied, see e.g. [5]. The 
optimal line < M, N > is given by minimizing the sum of squared Euclidean 
distances between the predicted lines (P,M) x (P;N) and the observed points <\ ik 
as minM.N 5Zr=i Sfc=i ^e((P^) x (P»N), q^). To make the relationship with 
the cost function (5) appear, we introduce a set of points Q i k on the 3D line. 
Using the fact that the Euclidean distance between a point and a line is equal 
to the Euclidean distance between the point and the projection of this point on 
the line, we rewrite the line triangulation problem as: 

n p 

min Cu ne with Cu ne = V' V' d 2 e (Pi(a ik M + (1 - a ife )N),q ifc ). (6) 

M,N ,OLik,... Z ' 

i— 1 k = 1 

Compare this cost function (5): the difference is that for line triangulation, the 
points are not supposed to match between the different views. Hence, a 3D point 
on the line is reconstructed for each image point, while in the POP triangulation 
problem, a 3D point on the line is reconstructed for each image point correspon- 
dence. Now, the interesting point is to determine if, in practice, cost functions 
(5) and (6) yield close solutions for the reconstructed 3D line. Obviously, an ex- 
perimental study is necessary, and we refer to §6. However, we intuitively expect 
that the results are close. 



Point-on-Line Triangulation. We study the problem of point-on-line optimal 
triangulation: given a 3D line, represented by two 3D points M and N, a set of 
corresponding image points . . . , q . . . , find a 3D point Q k ~ OfeM+ (1 — Ofe)N 
on the given 3D line, such that the squared Euclidean distances between the 
predicted and the observed points is minimized. 

For point-on-line triangulation, we formalise the problem as 
min Qfc YJi=i dl(Pi(a k M + (1 - a fc )N), q ifc ) and by introducing b, = P j: (M - N) 
and di = P,N, we obtain: 

n 

min Cp 0 i with C po i = d 2 {a k bi + d 4 , q ifc ). (7) 

OCk * ' 

2—1 



Sub-optimal linear algorithm. We give a linear algorithm, based on approxima- 
ting the optimal cost function (7) by replacing the Euclidean distance d e by the 
algebraic distance d a . The algebraic cost function is X^=i d 2 {a k bi + d,, q,fc) = 
Y^i = i HcKfcSfqjfc] x t>j + S[q,;] x d,;|| 2 . A closed-form solution giving the best a k in 



the least-squares sens is a k = — ^ 



L bj [q,fc] x I[q:fc] x d, 
5Zr= i b i T [qifc]xI[q;fc] x bi 



with I ~ S T S 
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Optimal polynomial algorithm. This algorithm consists in finding the roots of a 
degree- (3n — 2) polynomial in the parameter o^, whose coefficients depend on 
the b. ( , the d; and the q^. Due to lack of space, details are left to an extended 
version of the paper. 

5 Bundle Adjustment 

Bundle adjustment consists in minimizing the reprojection error over structure 
and motion parameters: 



n m p 



min 

PlvjPnjMi 



EEE dl(Pi(at jk Mj + (1 - ajk)Nj),q.ijk), 

i= 1 j = 1 k—1 



where we consider without loss of generality that all points are visible in all 
views. We use the Levenberg-Marquardt algorithm to minimize this cost func- 
tion, starting from an initial solution obtained by matching pairs of images and 
computing pair-wise fundamental matrices using the algorithms of §§2 and 3, 
from which the multiple- view geometry is extracted as in [11]. Multiple- view 
matches are formed, and the POPs are triangulated using the optimal method 
described in §4. 



6 Experimental Results 

We simulate a set of 3D POPs observed by two cameras, with focal length 1000 
pixels. To simulate a realistic scenario, each pop is made of 5 supporting points. 
The supporting points are projected onto the images, and a Gaussian centered 
noise is added. The images of the supporting lines are determined as the best 
fit to the noisy supporting points. These data are used to compare quasi-metric 
reconstructions of the scene, obtained using different algorithms. We mesure 
the reprojection error and a 3D error, obtained as the minimum residual of 
miiiH,, JA d 2 (Q H. u Qj), where Q . are the groung truth 3D points, Qj the re- 
construction points and H M an aligning 3D lromography. 

Comparing triangulation algorithms. The two first methods are based on tri- 
angulating the supporting line, then each supporting point using the linear so- 
lution (method ‘Line Triangulation + Lin’) or using the optimal polynomial 
solution (method ‘Line Triangulation + Poly’). The third method is Levenberg- 
Marquardt minimization of the reprojection error, for POPs (method ‘ML Pops’) 
or points (method ‘ML Points’). We observe on figure 3 (a) that triangulating 
the supporting line followed by the supporting points on this line (methods ‘Line 
Triangulation + *’) produce results close to the non-linear minimization of the 
reprojection error of the reprojection error of the POP (method ‘ML Pops’). Mi- 
nimizing the reprojection error individually for each point (method ‘ML Points’) 
produce lower reprojection errors. 

Concerning the 3D error, shown on figure 3 (b), we also observe that methods 
‘Line Triangulation + *’ produce results close to method ‘ML Pop’. However, 
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Fig. 3. Reprojection and 3D error when varying the added image noise variance to 
compare structure and motion recovery methods. 



we observe that method ‘ML Points’ gives results worse than all other methods. 
This is due to the fact that this method does not benefit from the structural 
constraints defining pops. 

Comparing bundle adjustment algorithms. The two first methods are based on 
computing the epipolar geometry using the eight point algorithm (method ‘Eight 
Point Alg.’) or the three POP algorithm (method ‘Three Pop Alg.’), then triangu- 
lating the POPs using the optimal triangulation method. The two other methods 
are bundle adjustment of pops and points respectively. We observe on figure 4 
(a) that the eight point algorithm yields the worse reprojection error, followed 
by the three POP algorithm and the eight corrected point algorithm. Bundle 





Fig. 4. Reprojection and 3D error when varying the added image noise variance to 
compare triangulation methods. 
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adjustement of POPs gives reprojection error slightly higher than with points. 
However, figure 4 (b) shows that bundle adjustment of pops gives a better 3D 
structure than point, due to the structural constraints. It also shows that the 
eight corrected point algorithm yields good results. 

7 Conclusions and Further Work 

We addressed the problem of automatic structure and motion recovery from 
images containing lines. We introduced a feature that we call POP, for Pencil-of- 
Points. We demonstrated our matching algorithm on real images. This confirms 
that the repeatability rate of pops is higher than the repeatability rates of the 
points and lines from which they are detected. This also shows that using pops, 
wide baseline matching and the epipolar geometry can be successfully computed 
in an automatic manner, using simple cross-correlation. Experimental results on 
simulated data show that due to the strong structural constraints, POPs yield 
structure and motion estimates more accurate than with points. 

Advantages for using pops are numerous. Briefly, localization, repeatability 
rate and structure and motion estimate are better with pops than with points, 
and robust estimation is very efficient since only three pairs of pops define an 
epipolar geometry. For this reason, we believe that this new feature could be- 
come standard for automatic structure-and-motion in man-made environment, 
i.e. based on lines. 

Further work will consist in investigating the determination of parameters fi 
needed to compute undistorted cross-correlation, since we believe that it could 
strongly improve the initial matching step, and studying methods for estimating 
the trifocal tensor from triplets of pops. 
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Abstract. Suppose that two perspective views of four world points are given, 
that the intrinsic parameters are known, but the camera poses and the world point 
positions are not. We prove that the epipole in each view is then constrained to lie on 
a curve of degree ten. We give the equation for the curve and establish many of the 
curve’s properties. For example, we show that the curve has four branches through 
each of the image points and that it has four additional points on each conic of the 
pencil of conics through the four image points. We show how to compute the four 
curve points on each conic in closed form. We show that orientation constraints 
allow only parts of the curve and find that there are impossible configurations 
of four corresponding point pairs. We give a novel algorithm that solves for the 
essential matrix given three corresponding points and one epipole. We then use 
the theory to describe a solution, using a 1 -parameter search, to the notoriously 
difficult problem of solving for the pose of three views given four corresponding 
points. 



1 Introduction 

Solving for unknown camera locations and scene structure given multiple views of a 
scene has been a central task in computer vision for several decades and in photogram- 
metry for almost two centuries. If the intrinsic parameters (such as focal lengths) of the 
views are known a priori, the views are said to be calibrated. In the calibrated case, it 
is possible to determine the relative pose between two views up to ten solutions and an 
unknown scale given five corresponding points [8,1]. In the uncalibrated case, at least 
seven corresponding points are required to obtain up to three solutions for the fundamen- 
tal matrix, which is the uncalibrated equivalent to relative pose [4], We will characterise 
the solutions in terms of their epipoles, i.e. the image in one view of the perspective 
center of the other view. If we have one point correspondence less than the minimum 
required, we can expect to get a whole continuum of solutions. In the uncalibrated case, 
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S. Army Research Laboratory under the Collaborative Technology Alliance Program, Coope- 
rative Agreement DAAD 19-0 1-2-00 12. The U. S. Government is authorized to reproduce and 
distribute reprints for Government purposes notwithstanding any copyright notation thereon. 
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it is well known [7] that six point correspondences give rise to a cubic curve of possible 
epipoles. However, to the best of our knowledge, the case of four point correspondences 
between two calibrated views has not been studied previously. We will show that four 
point correspondences beween two calibrated views constrain the epipole in each image 
to lie on a decic (i.e. tenth degree) curve. Moreover, if we disregard orientation con- 
straints, each point on the decic curve is a possible epipole. The decic curve varies with 
the configuration of the points and cameras and can take on a wide variety of beautiful 
and intriguing shapes. Some examples are shown in Figure 1. 







Fig. 1 . Some examples of decic curves of possible epipoles given four points in two calibrated 
images. 



Finally we apply the theory to describe a solution to the 3 view 4 point perspective 
pose problem (3v4p problem for short), which amounts to finding the relative poses 
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of three calibrated perspective views given 4 corresponding point triplets. The 3v4p 
problem is notoriously difficult to solve, but has a unique solution in general [5,10], It 
is in fact overconstrained by one, meaning that four random point triplets in general can 
not be realised as the three calibrated images of four common world points. Adjustment 
methods typically fail to solve the 3v4p problem and no practical numerical solution is 
known. Our theory leads to an efficient solution to the 3v4p problem that is based on 
a one-dimensional exhaustive search. The search procedure evaluates the points on the 
decic curve arising from two of the three views. Each point can be evaluated and checked 
for three view consistency with closed form calculations. The solution minimises an 
image based error concentrated to one point. This is reminiscent of [12], which works 
with three or more views in the uncalibrated setting, see also [11]. Our algorithm can 
also be used to determine if four image point triplets are realisable as the three calibrated 
images of four common world points. 

Many more point correspondences than the minimal number are needed to obtain 
robust and accurate solutions for structure and motion. The intended use for our 3v4p 
solution is as a hypothesis generator in a hypothesise-and-test architecture such as for 
example [8,9,2]. Many samples of four corresponding point triplets are taken and the 
solutions are scored based on their support among the whole set of observations. 

The rest of the paper is organised as follows. In Section 2, we establish some notation 
and highlight some known results. In Section 3, we describe the geometric construction 
that serves as the basis for the main discoveries of the paper. In Sections 4 and 5, we 
work out the consequences of the geometric construction. In Section 6, we give the 
algebraic expression for the decic curve. In Section 7 we establish further properties of 
the curve of possible epipoles and in Section 8 we reach our main result. In Section 9 
we investigate implications of orientation constraints. In Section 10 the 3v4p algorithm 
is given. Section 1 1 concludes. 



2 Preliminaries 

We broadly follow the notational conventions in [4,13]. Image points are represented by 
homogeneous 3-vectors x. Plane conics are represented by 3 x 3 symmetric matrices 
and we often refer to such a matrix as a conic. The symbol ~ denotes equality up to 
scale. We use the notation A* to denote the adjugate matrix of A, namely, the transpose 
of the cofactor matrix of A. We will use |A| to denote the determinant and l:r( A) to 
denote the trace of the matrix A. We assume the reader has some background in multi- 
view geometry and is familiar with concepts such as camera matrices, the absolute 
conic, the image of the absolute conic (IAC) under a camera projection and its dual 
(the DIAC). When discussing more than one view, we generally use prime notation to 
indicate quantities that are related to the second image; for example x and x' might be 
corresponding image points in the first and second view, respectively. Similarly, we use 
e and e' to denote the two epipoles and u> and a/ to denote the IACs in two views. 

Given corresponding image points x, -fT x\ in two views the epipolar constraint [7, 
1] is: 
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Theorem 1. The projective parameters of the rays between e and Xi are homographi- 
cally related to the projective parameters of the rays between e ' and x\. 

The situation is illustrated in Figure 2. The condition asserts the existence of a ID 
homography that relates the pencil of lines through e to the pencil of lines through e'. 
This homography is called the epipolar line homography. 

The epipolar constraint relates corresponding image points. For the pair u>, u>' of 
corresponding conics we have the Kruppa constraints'. 

Theorem 2. The two tangents from e to the IAC uj are related by the epipolar line 
homography to the two tangents from e' to the IAC u>' . 

The constraint is illustrated in Figure 2. 





Fig. 2. Illustration of Theorems 1 and 2. The diagram shows two images, each with four image 
points, a conic and an epipole. The pencils of rays from the epipoles to the image points are related 
by the epipolar line homography. Similarly, the epipolar tangents to the images of the absolute 
conics are also related by the epipolar line homography. 



These algebraic constraints treat the pre-image of an image point as an infinite line 
extending both backwards and forwards from the projection centre. However, the image 
rays are in reality half-lines extending only in the forward direction. Moreover, unless 
our images have been mirrored, we typically know which direction is forward. The 
constraint that any observed world point should be on the forward part is referred to 
as the orientation constraint. The orientation constraints imply that the epipolar line 
homography is oriented and thus preserves the orientation of the rays in Theorem 1 , see 
[14] for more details. 

3 The Geometric Construction 

Assume that we have two perspective views of four common but unknown world points 
and that the intrinsic parameters of the cameras are known but their poses are not. Let 
the image correspondences be Xi -o- x\ . In general, no three out of the four image points 
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in either image are collinear and we shall henceforth exclude the collinear case from 
further consideration. 

Accordingly, we may then choose [13] projective coordinates in each image such 
that the four image points have the same coordinates in both images. In other words, we 
may assume that Xi = x\ and we shall do this henceforth and think of both image planes 
as co-registered into one coordinate system. 

It then follows from Steiner’s and Chasles’s Theorems [13] that the constraint from 
Theorem 1 can be converted into: 

Theorem 3. The epipoles e, e' and the four image points must lie on a conic B. Con- 
versely, two epipoles e, e' that are conconic with the four image points can satisfy the 
epipolar constraint. 

An illustration is given in Figure 3. This conic will be important in what follows and the 
reader should take note of it now. When e and e' are conconic with the four image points, 
there is a unique epipolar line homography that makes the four lines through e correspond 
to the four lines through e' . One way to appreciate B is to note that we can parameterize 
the pencil of lines through e (or e!) by the points of B and that corresponding lines of 
the two pencils meet the conic B in the same point. Armed with this observation we can 
translate the Kruppa constraints into: 

Theorem 4 . The Kruppa constraints are equivalent to the condition that the two tangents 
to u from e intersect B in the same two additional points as the two tangents to uj' from 
e'. 

This geometric construction will serve as a foundation for the rest of our development. 
The situation is depicted in Figure 3. Loosely speaking, the two projections (from the 
epipoles) of the IACs onto B must coincide. 




Fig. 3. Left: Illustration of Theorem 3. When the two image planes are co-registered so that 
corresponding image points coincide, the epipoles are conconic with the four image points. Right: 
The geometric construction corresponding to Theorem 4. The images of the IACs u,u>' made by 
projecting through the epipoles and onto B have to coincide. This construction is the basis for the 
rest of our development. 
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4 Projection onto the Conic B 

To make progress from Theorem 4 we will work out how to perform the projection of 
an IAC co onto a conic B = B(e) that is determined by an epipole and the four image 
points. One can think of the projection as being defined by the two points where the 
tangents to the IAC from an epipole meet B. But the two tangents do not come in any 
particular order, which is a nuisance. To avoid this complication we use the line joining 
the two intersection points on B as our representation. This is accomplished by: 

Theorem 5. The projection of the IAC t o onto the proper conic B through the epipole e 
is given by the intersections of the line (co O B)e with B, where we define the conic 

(cooB) = 2Bco*B-tr(co*B)B. (1) 

Proof: 1 We may choose [13] the coordinate system such that B is parameterized 
by [ 0 2 9 1 ] where 9 is scalar. Let 9 correspond to e and let A parameterise an 
additional point on B. The line through the two points defined by 9 and A is then 
1(9 , A) = [l — (0+A) 0A] T . This line is tangent to co when Z T co*l = 0, which by ex- 
panding both expressions can be seen equivalent to [A 2 A l] (cooB) [9 2 9 l] = 0. 
Hence, the projection of co onto B through the point [ 9 2 9 1 ] T is defined by the 

intersections of the line (co o B) [9 2 9 l] T with B. The symmetric matrix (co o B) 
thus represents a conic locus that has the properties stated in the theorem. Using the 
properties of trace, it can be verified that Equation (1) is a projectively invariant formula 
for a conic. The theorem follows. □ 

The situation is illustrated in Figure 4. We combine Theorem 4 and 5 to arrive 
at 

Theorem 6. Given that e and e' are conconic with the four image points, the Kruppa 
constraints are equivalent to the constraint that the polar lines (loo B)e and (co' o B)e r 
coincide. 



5 The Four Solutions 

Note that (co o B) = B(lo ■ B) where we define the homography 

(co ■ B) = 2 co*B - tr(co*B)I. (2) 

In view of this, we can “cancel” a B in Theorem 6 and arrive at: 

1 This theorem is a stronger version of Theorems 2 and 3, pages 179-180 in [13], which do 
not give a formula for (co o B). It is possible, but not necessary for our purposes, to describe 
(to o B) in classical terminology by saying that co is the harmonic envelope of B and (coo B). 
A given co defines a correspondence B (co o B) between plane conics in the sense that 
co o (co o B) ~ B. Note also that the o operator is not commutative: projecting co onto B is 
different from projecting B onto co. 
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Fig. 4. The conic locus (t o o B) from Theorem 5. Note that the line (u> o B)e is the polar line of e 
with respect to {ui o B). This means that we are using the pole-polar relationship defined by the 
conic locus (tu o B) to perform the projection. Equation (1) shows that (t o o B) belongs to the 
pencil of conics determined by B and Boj* B. This is a manifestation of the fact that (tooB) goes 
through the four points where the double tangents between ui and B touches B. These four points 
lie on both B and Bu*B. Moreover, the double tangents between lu and (uioB) touches (t ooB) 
at the same four points. On the right, the line (woB)e and its pole ( u> ■ B)e (see the following 
section for the definition of (w • B)) with respect to B is shown. Both can be used to represent the 
projection of lj onto B through e. 



Theorem 7. The epipoles are related by the seventh degree mapping 



e' ~ (a/ ■B)*{wB)e. (3) 

The Kruppa constraints single out four solutions 2 for the epipole e on each proper conic 
B. The solutions are the intersections of B with the conic 

C=(w- B) t {oj 1 ■ B)* t B(J ■ B)*(oj ■ B). (4) 

On the three conics B of the pencil for which |(u/ • B)\ =0, the four solutions group 
into two pairs of coincident solutions. 

Proof: Any two conics B \ , B> from the pencil can be used as a basis for the pencil and 
we have 



B(e) = {e T B 2 e)B 1 - ( e T B ie )B 2 , (5) 

i.e. B(e) can be expressed quadratically in terms of e. According to Theorem 6, 
(u/ o B)e! ~ (ui o B)e, which for proper B is equivalent to (to' ■ B)e' ~ (u> ■ B)e. 
If we assume that \(ui' ■ B) | ^ 0, this is equivalent to Equation (3), which is seen to 
be a 7-th degree mapping. Since e' has to be on B , i.e. e ,T Be! — 0, we get that e 

2 Apart from the four image points. 
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must be on the conic C. The rest of the theorem follows from detailed consideration of 
the case when ( oj 1 ■ B) becomes rank 2, in which case C degenerates to a repeated line. □ 

If we use Equation (5) to express B in terms of e, then C is a 14-th degree 
function of e and we get 

Theorem 8. An epipole hypothesis e that gives rise to a proper conic B satisfies the 
16-th degree equation 



e T Ce = 0 (6) 

if and only if it can satisfy the epipolar and Kruppa constraints. 

An example plot of the set of points that satisfy Equation (6) is shown in Figure 5. As 
indicated by the following theorem, it also includes the six lines through all pairs of 
image points as factors: 

Theorem 9. For degenerate B (consisting of a line-pair) the homography (w ■ B) inter- 
changes the lines of the line-pair. As a result, the curve defined by Equation (6) contains 
the six lines through pairs of the four image points, i.e. it contains the factor \B\. 

Proof: As e moves to approach one of the lines of a line pair, the two points defined by 
projecting c o onto B through e approaches the other line of the line pair. Hence, the line 
( to o B)e becomes the other line. In a similar fashion, the point (uj ■ B)e, which is the 
pole of (uj o B)e with respect to B, becomes a point on the other line. Thus, we get 
C ~ 1? when E> is a line-pair. Since e lies on B per definition of B, the theorem follows. □ 

Hence, Equation (6) defines a superset of the possible epipoles as determined by 
the epipolar and Kruppa constraints. However, not all the points on the six lines are 
allowed by the Kruppa constraints. In fact, since Theorem 4 applies for any B, one can 
work out the consequences of the geometric construction specifically for degenerate 
B in a similar fashion as for proper B. This leads to the following theorem, which we 
state without proof: 

Theorem 10. Apart from the four image points, there are at most two possible epipoles 
e on any line joining a pair of the four image points. 



6 The Decic Expression 

The algebraic endeavour of eliminating the factor \B\ from Equation (6) to arrive at a 
decic expression is surprisingly involved. We will just state the result. Define 

D = u>*, t = tr(DB ), U = (u> • B) = 2DB — tl (7) 



and analogously for the primed entities. Then the decic expression is 
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Fig. 5. The 16-th degree expression in Equation (6) defines a superset of the possible epipoles as 
determined by the epipolar and Kruppa constraints. However, it also includes the six lines through 
all pairs of image points as factors, which can also be eliminated. The plot on the left shows the 
16-th degree curve, including the six lines, and the plot on the right shows the decic curve resulting 
when removing the six lines. The four small black dots are image points. 



e T G(e)e = 0, (8) 

where G(e) is the conic defined by the symmetric part of 

4U T D'*B*D , *U+2t'U T D'*U'U+4t' 2 BDD'*(DB-tI)-4:t' 2 tr(B*D'*)D*+t l4 D*+t 2 t l2 D l *. (9) 

G(e) can readily be seen to be octic in terms of e, since D is constant and B, U and t 
are all quadratic in e. Some examples of the decic are shown in Figure 1 . 

7 Further Properties of the Curve 

The decic expression (8) is in fact exactly the set of possible epipoles under the epipolar 
and Kruppa constraints. The following leads to a property of the set of possible epipoles 
that serves as a cornerstone in getting this result. 

Theorem 11. Given three point correspondences x t T4 x\ and the epipole e in one 
image, the epipolar and Kruppa constraints lead to four solutions for the other epipole 

ef 

Proof: To prove this we give a novel constructive algorithm in Appendix A. 

Theorem 12. The set of possible epipoles according to the epipolar and Kruppa con- 
straints has exactly four branches through each of the four image points. 3 The branches 
are continuous. 4 For points in general position, 0,2 or 4 of the branches can be real. 

3 We allow the world points to coincide with the projection centres of the cameras. 

4 Assume the joint epipole (e, e') describes a smooth curve in P 2 x P 2 . When we talk about 
tracing a curve branch in an image we really have in mind tracing the curve in P 2 x P 2 . The 
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Proof: According to Theorem 1 1 we have that for a general epipole position e, there 
are four solutions for the essential matrix that obey three given point pairs. When e 
coincides with one of the image points, Xi say, all four solutions that obey the other 
three image point pairs also satisify x\ TT x\, since the epipolar line l' joining e' and x\ 
always map into a line l through x\. Moreover, the line l determined by the other three 
image point pairs changes continuously if we change e and hence / has to be the tangent 
direction of the corresponding curve branch. It is in fact the same as the tangent at x\ 
to the conic determined by the four image points and the corresponding solution for 
e! . The tangent direction of each branch can thus be computed in closed form with the 
algorithm in Appendix A. It is clear from the algorithm that points in general position 
can not give rise to an odd number of real solutions. Using the algorithm, we have found 
examples of cases with 0, 2 and 4 real solutions. □ 

The situation is illustrated in Figure 6. Two real branches is by far most com- 
mon. 




Fig. 6. Left: As indicated by Theorem 12, there are four branches of the curve of possible epipoles 
through each of the four image points. Each curve branch is tangent to the conic B that includes the 
four image points and the epipole e! corresponding to e coincident with the image point according 
to Theorem 1 2 and Appendix A. Middle : An example of a decic curve with 0, 2 and 4 real branches 
through the image points. The image points are marked with small circles. Right: Close-up. 



8 Main Result 

Theorem 13. An epipole hypothesis e satisfies the epipolar and Kruppa constraints if 
and only if it lies on the decic curve defined by Equation (8). 

Proof: We give a sketch of the detailed proof, which takes several pages. It follows from 
Theorems 8 and 9 that the possible epipoles off of the six lines are described by the decic. 

curve projects into P 2 in such a way that four points map to each image point Xi and the four 
non-intersecting branches in P 2 x P 2 project to the four intersecting branches of the image 



curve. 
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The key is then to use Theorems 7 and 12 to establish that ten degrees are necessary and 
that the decic does not have any redundant factors, which follows from Bezout’s Theo- 
rem when considering the number of possible e on general B. Finally, by Theorem 10 
and continuity of the geometric construction, one can establish that the decic intersects 
any one of the six lines at the correct two additional points apart from the image points. □ 

We can also get a more complete version of Theorem 7 that applies even when 
the conic B from Equation (5) is degenerate. 

Theorem 14. Given any point y distinct from the four image points, the possible epipoles 
on the conic B{y) according to the epipolar and Kruppa constraints are exactly the four 
image points plus the intersections between the two conics B{y ) and G{y). 

Proof: According to Theorem 13, a point e is a possible epipole iff it lies on G(e). 
An epipole hypothesis e apart from the four image points generates the same conic 
B(e) = B(y) as y iff it lies on B(y). Since G(e) can be written as a function of B 
only, all e on B(y) apart from the four image points generate the same G(e) = G(y). 
Thus, the points on B(y) apart from the four image points satisfy the decic iff they lie 
on G(y). Finally, the four image points lie on B(y) by construction and they are always 
possible epipoles. □ 

Theorem 15. The curve of possible epipoles according to the epipolar and Kruppa 
constraints has exactly ten singular points. The four image points each have multiplicity 
four. In addition, there are exactly three pairs of nodal points with multiplicity two. 5 
The three pairs of nodal points occur on the three conics B for which \(u>' ■ B)\ =0. 
These conics are exactly the three B of the pencil with an inscribed quadrangle that is 
also circumscribed to u/. 

Proof: Recall Theorem 6 and observe that on proper conics B , the solution e has multi- 
plicity two exactly when the line l = (u>oB)e obtained by projecting lu onto B through 
e also can be obtained by projecting at' onto B in two distinct ways through two distinct 
points e' on B. This is illustrated in Figure 7. According to Poncelet’s Porism [13], given 
proper conics B and u>’, we have two possibilities. Either there is no quadrangle inscribed 
in B that is also circumscribed to ut', or there is one such quadrangle with any point on 
B as one of its vertices. In the former case, no epipole hypothesis e' on B in the second 
image ever generates the same line l' = (a/ o B)e' as some other epipole hypothesis. 
Hence no solution for e can then have multiplicity two. In the latter case, every epipole 
hypothesis e! generates the same line V as exactly one other epipole hypothesis. Thus, in 
this case the solutions e on B always have multiplicity two. The latter case has to happen 
exactly when | (u/ ■ B) | = 0 and we see that this must be the same as the condition that 
there is a quadrangle inscribed in B that is also circumscribed to of . The remaining parts 
of the theorem follow from Theorems 7, 12 and 13. □ 

5 By the degree-genus formula [6] the genus of the curve is therefore ((10 — 1) ( 10 — 2) — 4 x 
4(4 — 1) — 6 x 2(2 — l))/2 = 6 so, in particular, the curve is not rational. 
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9 Bringing in the Orientation Constraint 

The orientation constraint asserts that the space point corresponding to a visible image 
point must lie in front of the camera. The situation is illustrated in Figure 7. Given e 
on the decic, verification of the orientation constraints is straightforward. Equation (3) 
determines d . The epipolar line homography, and hence the essential matrix, is then 
determined by the point correspondences. It is well known [8] that the essential matrix 
corresponds to four possible 3D configurations and that the orientation information from 
a corresponding point pair singles out one of them. The orientation constraints can be 
satisfied exactly when all four point correspondences indicate the same configuration. 
We will split the orientation constraint into two conditions: 

Firstly, the two forward half-rays of an image correspondence lie in the same half- 
plane (the baseline separates each epipolar plane into two half-planes). Then the common 
space point is either on the forward part of both rays, or on the backward part of both rays. 
This condition is called the oriented epipolar constraint because it is satisfied exactly 
when the epipolar line homography is oriented [14]. 

Secondly, the forward half-rays should converge in their common half-plane. We 
will refer to this as the convergence constraint. 




Fig. 7. Left: For there to be multiple e ' corresponding to one e, there has to be a quadrangle 
inscribed in B that is also circumscribed to u/ . According to Poncelet’s porism, there is either 
no such quadrangle, or a whole family of them. Right: The orientation constraint is that space 
points should be in the forward direction on their respective image rays. It can be partitioned into 
requiring 1) that the forward half-rays point into the same half-plane and 2) that the half-rays 
converge in that half-plane. 



Theorem 16. The satisfiability of the oriented epipolar constraint can only change at 
those points e of the decic curve for which e or one of its possible corresponding e' 
coincide with one of the four image points, i.e. only at the four image points or at the 
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up to 4 x 4 real points that correspond to one of the four image points according to 
Theorem 12 and the algorithm in Appendix A. 

Proof: When we move e along the decic curve and neither e nor its corresponding 
e' coincides with one of the four image points, e' and the epipolar line homography 
changes continuously with e. Moreover, since the epipolar constraint is satisfied for 
all points on the decic, the ray orientations can not change unless one of the epipoles 
coincides with one of the four image points. □ 

The situation is illustrated in Figure 8. 

Theorem 17. A branch of the decic through an image point can have at most one side 
allowed by the oriented epipolar constraint. One side is allowed iff the epipolar line 
homography accompanying the pair of epipoles e, e' corresponding to the branch is 
oriented with respect to the other three image correspondences. 

Proof: For a particular branch, through x say, e' and the epipolar line homography 
changes continuously with e, so the orientations of the rays to the other three image 
points do not change at x. However, the orientation of the ray to x changes when e 
passes through x and the theorem follows. □ 

For points on the decic, there is a valid epipolar geometry. The rays can only 
change between convergent and divergent when they become parallel. This can only 
occur when the angles between the image points and the epipoles are equal in both 
images. The Euclidean scalar product between image directions x and y is encoded up 
to scale in the IAC u> as <x\y>= x T u>y. Using this and e' ~ U'*Ue it can be shown 
that parallelism can only occur when 6 

{x T u' x){e T U T U'* T u>'U'*Ue){x r u>e) 2 - (x 1 ux)(e T ue)(x T u'U'*Ue) 2 = 0, 

( 10 ) 

which is a 16th degree expression in e. Its intersections with the decic are the only places 
where the rays from x can change between converging and diverging. 

10 The 3v4p Algorithm 

Given four point correspondences in three calibrated views, we choose two views and 
trace out the decic curve for those two views with a one-dimensional sweep driven by 
a parameter 6. For each value of 0 , all computations can be carried out very efficiently 
in closed form. The parameter is used to indicate one conic B from the pencil of co- 
nics. Given B, we calculate the conic G from Equation (9) as a function of B and the 
intersections between G and B can then be found in closed form as the roots of a quartic 
polynomial. This yields up to four solutions for e. For each solution, the corresponding 
e' can be found through Equation (3). If we rotate both coordinate systems so that the 

6 Remember that we are assuming that the points are co-registered. 
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Fig. 8. Examples of curves of possible epipoles after the oriented epipolar constraint and the 
Kruppa constraints have been enforced. The small circles mark the four image points, while the 
squares mark the up to 4 x 4 real points that correspond to one of the four image points according 
to Theorem 12 and the algorithm in Appendix A. As indicated by Theorem 16. the curve has no 
loose ends apart from those points. The curves are rendered as the orthographic projection of a 
half-sphere and all curve segments that appear loose actually reappear at the antipode. There are 
also configurations of four point pairs for which the set of possible epipoles is completely empty, 
i.e. all configurations of four points in two calibrated views are not possible according to the 
oriented epipolar constraint. 



epipoles are moved to the origin, finding the epipolar line homography is just a simple 
matter of solving for a 1-D rotation with possible reflection. Thus, we get the essential 
matrix for the two views corresponding to each solution. Following [8], we can then 
select a camera configuration for the two views and get the locations of the four points 
through triangulation. Each solution then leads to up to four solutions for the pose of 
the third view when solving the three point perspective pose problem [3] for three of the 
points. The orientation constraints are used to disqualify solutions for which the space 
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Fig. 9. Possible epipoles when all constraints (epipolar, Kruppa and full 3D orientation) are en- 
forced. 



points are not on the forward part of the image rays. Finally, the fourth point can be 
projected into the third view. For the correct value of 6 and the correct solution, the pro- 
jection of the fourth point should coincide with its observed image position. Moreover, 
this will only occur for valid solutions and in general there is a unique solution. Thus, 6 
is swept through the pencil of conics and the solution resulting in the reprojection closest 
to the observed fourth point position is selected. 

This algorithm has been implemented and shown to be very effective in practice. 
Experimental results will appear in an upcoming journal paper. 



11 Conclusion 

We have given necessary and sufficient conditions for the epipolar and Kruppa con- 
straints to be satisfied given four corresponding points in two calibrated images. The 
possible epipoles are exactly those on a decic curve. We have shown that the second 
epipole is related to the first by a seventh degree expression. We have shown that if the 
orientation constraints are taken into account, only a subset of the decic curve corre- 
sponds to possible epipoles. As a result, we have found that there are configurations 
of four pairs of corresponding points that can not occur in two calibrated images. This 
is similar in spirit to [14]. We have shown that points on the decic curve can be ge- 
nerated in closed form and that it is possible to trace out the curve efficiently with a 
one-dimensional sweep. This yields a solution to the notoriously difficult problem of 
solving for the relative orientation of three calibrated views given four corresponding 
points. In passing, we have given a novel algorithm for finding the essential matrix given 
three point correspondences and one of the epipoles. 
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A Three Points Plus Epipole 

To support our arguments we need to show that given three point correspondences 
Xi F 4 x\ and the epipole e in one image, the epipolar and Kruppa constraints lead to four 
solutions for the other epipole e'. To do this, we give a novel algorithm that constructs 
the four solutions. If the epipole in the first image is known, we can rotate the image 
coordinate system so that it is on the origin. Then the essential matrix is of the form 
£ = [£i E 2 0; £3 £4 0; £5 Eq 0] .A point correspondence x' x contributes the 
constraint X T E = 0, where 



X = [X 4 X 1 x[x2 x' 2 Xi x' 2 X2 x' 3 X± X 3 X2 



(ID 



and 



£= [£i £ 2 £3 £4 £5 £ 6 ] T - ( 12 ) 

If the vectors X T from three point correspondences are stacked, we get a 3 x 6 matrix. 
£ must be in its 3-dimensional nullspace. Let V, Z, W be a basis for the nullspace. Then 
£ is of the form £ = yY + zZ + wW, where y,z,w are some scalars. Since an essential 
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matrix E is characterised by having two equal singular values and one zero singular 
value, we have exactly the two additional constraints E\Ei + E3E4 + E$Eq = 0 and 
Ef + £3 + E§ = E% + E4 + Eq. These constraints represent two conics and four 
solutions for [y z w ] T . 



The views and conclusions contained in this document are those of the authors and should not 
be interpreted as representing the official policies, either expressed or implied, of the Army 
Research Laboratory or the U. S. Government. 
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Abstract. A dynamic visual search framework based mainly on inner- 
scene similarity is proposed. Algorithms as well as measures quantifying 
the difficulty of search tasks are suggested. Given a number of candidates 
(e.g. sub-images), our basic hypothesis is that more visually similar can- 
didates are more likely to have the same identity. Both deterministic and 
stochastic approaches, relying on this hypothesis, are used to quantify 
this intuition. Under the deterministic approach, we suggest a measure 
similar to Kolmogorov’s e-covering that quantifies the difficulty of a se- 
arch task and bounds the performance of all search algorithms. We also 
suggest a simple algorithm that meets this bound. Under the stochastic 
approach, we model the identities of the candidates as correlated ran- 
dom variables and characterize the task using its second order statistics. 
We derive a search procedure based on minimum MSE linear estima- 
tion. Simple extensions enable the algorithm to use top-down and/or 
bottom-up information, when available. 



1 Introduction 

Visual search is required in situations where a person or a machine views a scene 
with the goal of finding one or more familiar entities. The highly effective visual- 
search (or more generally, attention) mechanisms in the human visual system 
were extensively studied from psychophysics and physiology points of view. Yar- 
bus [24] found that the eyes rest much longer on some elements of an image, 
while other elements may receive little or no attention. Neisser [11] suggested 
that the visual processing is divided into pre-attentive and attentive stages. The 
first consists of parallel processes that simultaneously operate on large portions 
of the visual field, and form the units to which attention may then be directed. 
The second stage consists of limited-capacity processes that focus on a smal- 
ler portion of the visual field. Triesman and Gelade ( feature integration theory 
[19]) formulate an hypothesis about how the human visual system performs pre- 
attentive processing. They characterized (qualitatively) the difference between 
search tasks requiring scan (serial) and those which do not (parallel, or pop- 
out). While several aspects of the Feature Integration Theory were criticized, 
the theory was dominant in visual search research and much work was carried 
out based on its premises, e.g. to understand how feature integration occurs 
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(some examples are [8,23,21]). Duncan and Humphreys rejected the dichotomy 
of parallel vs. serial search and proposed an alternative theory based on simi- 
larity [3]. According to their theory, two types of similarities are involved in a 
visual search task: between the objects in the scene, and between the objects 
and prior knowledge. They suggest that when a scene contains several similar 
structural units there is no need to treat every unit individually. Thus, if all 
non-targets are homogeneous, they may be rejected together resulting in a fast 
(pop-out like) detection, while if they are heterogeneous the search is slower. 

Several search mechanisms were implemented, usually in the context of HVS 
(human visual system) studies (e.g. [8,21,23,5]). Other implementations focused 
on computer vision applications (e.g. [7,17,18]), and sometimes used other sour- 
ces of knowledge to direct visual search. For example, one approach is to search 
first for a different object, easier to detect, which is likely to appear close to the 
sought for target ([15,22]). Relatively little was done to quantitatively characte- 
rize the inherent difficulty of search tasks. Tsotsos [20] considers the complexity 
of visual search and proves, for example, that spatial boundedness of the target 
is essential to make the search tractable. In [22], the efficiency of indirect search 
is analyzed. 

This work has two goals: to provide efficient search algorithms and to quanti- 
tatively characterize the inherent difficulty of search tasks. We focus on the role 
of inner-scene similarity. As suggested in [3], the HVS mechanism uses similarity 
between objects of the same identity to accelerate the search. In this paper we 
show that computerized visual search can also benefit from such information, 
while most visual search application totally ignore this source of knowledge. We 
take both deterministic and stochastic approaches. Under the deterministic ap- 
proach, we characterize the difficulty of the search task using a metric-space cover 
(similar to Kolmogorov’s e-covering [9]) and derive bounds on the performance 
of all search algorithms. We also propose a simple algorithm that provably meets 
these bounds. Under the stochastic approach, we model the identity of the can- 
didates as a set of correlated random variables taking target/non-target values 
and characterize the task using its second order statistics. We propose a linear 
estimation based search algorithm which can handle both inner-scene similarity 
and top-down information, when available. 

Paper outline: The context for visual search and some basic intuitive assumptions 
are described in Sect. 2. Sect. 3 develops bounds on the performance of search 
algorithms, providing measures for search tasks’ difficulty. Sect. 4 describes the 
VSLE algorithm based on stochastic considerations In Sect. 5 we experimentally 
demonstrate the validity of the bounds and the algorithms’ effectiveness. 1 

2 Framework 

2.1 The Context Candidate Selection and Classification 

The task of looking for object/s of certain identity in a visual scene is often divi- 
ded into two subtasks. One is to select sub-images which serve as candidates. The 

1 A preliminary version of the VSLE algorithm was presented in [1], 
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other, the object recognition task, is to decide whether a candidate is a sought 
for object or not. The candidate selection task can be performed by a segmenta- 
tion process or even by a simple division of the image into small rectangles. The 
candidates may be of different size, bounded or unbounded [20], and can also 
overlap. The object recognizer is usually computationally expensive, as the ob- 
ject appearance may vary due to changes in shape, color, pose, illumination etc. 
The recognizer may need to recognize a category of objects (and not a specific 
model), which usually makes it even more complex. 

The object recognition process gets the candidates, one by one, after some 
ordering. An efficient ordering, which is more likely to put the real objects first, 
is the key to high efficiency of the full task. This ordering is the attentional 
mechanism on which we focus here. 

2.2 Sources of Information for Directing the Search 

Several information sources enabling more efficient search are possible: 
Bottom-up saliency of candidates - In modelling HVS attention, it is often 
claimed that a saliency measure, quantifying how every candidate is different 
from the other candidates in the scene, is calculated. ([19,8,7]). Saliency is im- 
portant in directing attention, but it can sometimes mislead or not be applicable 
when, say, the scene contains several similar targets. 

Top-down approach - When prior knowledge is available, the candidates may 
be ranked by their degree of consistency with the target description ([23,6]). In 
many cases, however, it is hard to characterize the objects of interest in a way 
which is effective and inexpensive to evaluate. 

Mutual similarity of candidates - Usually, a higher inner-scene visual simi- 
larity implies a higher likelihood for similar (or equal) identity ([3]). Under this 
assumption, after revealing the identity of one (or a few) candidates, it can effect 
the likelihood of the remaining candidates to have the same/different identity. 

In this paper we focus on (the less studied) mutual similarity between can- 
didates, and assume that no other information is given. Nevertheless, we show 
how to handle top-down information and saliency, when available. 

To quantify similarity, we embed the candidates as points in a metric space 
with distances reflecting dissimilarities. We shall either assume that the distance 
between two objects of different identities is larger than a threshold (determi- 
nistic approach), or that the identity correlation is a monotonically descending 
function of this distance (stochastic approach) . 

2.3 Algorithms Framework 

The algorithms we propose share a common framework. They begin from an 
initial priority map, indicating the prior likelihood of each candidate to be a 
target. Iteratively, the candidate with the highest priority receives the attention. 
The relevant sub-image is examined by a high-level recognizer, which we denote 
the recognition oracle. Based on the oracle's response and the previous priority 
map, a new priority map is calculated, taking into account the similarities. 
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Usually, systems based on bottom-up or top-down approaches suggest calcu- 
lating a saliency map before the search starts, pre-specifying the scan order. This 
static map may change only to inhibit the return to already attended locations 
[8]. The search algorithms proposed here, however, are dynamic as they change 
the priority map based on the results of the object recognizer. 

2.4 Measures of Performance 

For quantifying the search performance, we take a simplified approach and as- 
sume that only the costs associated with calling the recognition oracle are sub- 
stantial. Therefore, we measure (and predict) the number of queries required to 
find a target. 

3 Deterministic Bounds of Visual Search Performance 

In this section we analyze formally the difficulty of search tasks. Readers inte- 
rested only in the more efficient algorithms based on a stochastic approach can 
skip this section and continue reading from section 4.1. 



Notations. We consider an abstract description of a search task as a pair (V, l), 
where X = {xi,X 2 , ■ ■ ■ ,x n } is a set of partial descriptions associated with the 
set of candidates, and l : X — > { T , D} is a function assigning identity labels 
to the candidates. l(xi) = T if the candidate Xi is a target, and l(xi ) = D if 
Xi is a non-target (or a distractor). An attention, or search algorithm, A, is 
provided with the set X, but not with the labels l. It requires costi(A,A, Z) 
calls to the recognizer oracle, until the first target is found. We refer to the set 
of partial descriptions X = {aq, aq, . . . , x n } as points in a metric space (S', d), 
d : S x S — >■ R + being the metric distance function. The partial description can 
be, for example, a feature vector, and the distance may be the Euclidian metric. 



A Difficulty Measure Combining Targets’ Isolation and Candidates’ 
Scattering. We would like to develop a search task characteristic which quanti- 
fies the search task difficulty. To be effective, this characteristic should combine 
two main factors: 

1. The feature-space-distance between target and non-target candidates. 

2. The distribution of the candidates in the feature space. 

Intuitively, the search is easier when the targets are more distant from non- 
targets. However, if the non-targets are also different from each other, the search 
again becomes difficult. A useful quantification for expressing a distribution of 
points in a metric space uses the notion of a metric cover [9] . 

Definition 1. Let X C S be a set of points in a metric space (S, d). Let 2 s be 
the set of all possible subsets of S. C C 2 s is ‘a cover’ of X if\/x £ X3C £ C 

s.t. 
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Definition 2. C C 2 s is a ‘do-cover’ of a set X if C is a cover of X and if 
VC £ C diameter(C) < do, where diameter(C) is max Cl tC2 eC d(cl, c2). 

Definition 3. A ‘minimum-do -cover’ is a do-cover with a minimal number of 
elements. We shall denote a minimum-do -cover and its size by Cd 0 (X) and 
Cd 0 (X), respectively. 

If, for example, X is a set of feature vectors in a Euclidian space, Cd 0 (X ) is 
the minimum number of m-splreres with diameter do required to cover all points 
in X. 

Definition 4. Given a search task ( X,l ), let the ‘max-min-target- distance’, de- 
noted dr, be the largest distance of a target to its nearest non-target neighbor. 

Theorem 1. Let Xd 0)C denote all the family of search tasks (X,l) for which 
dr, the max-min-target- distance, is bounded from below by some do (dx > do) and 
for which the minimum-do-cover size is c (cd 0 ( X ) = c) . The value c quantitatively 
describes the difficulty of Xd 0 , c In the sense that: 

1. Any search algorithm A needs to query the oracle for at least c candidates in 
the worst case before finding a target. ffiA 3(X, l) £ Xd 0 ,c ; cost±(A, X , l) > c) 

2. There is an algorithm that, for all tasks in this family, needs no more than 
c queries for finding the first target. (3A\/(X,l) £ Xd 0 ,c costi(A, X,l ) < c) 

Proof: 1. We first provide such a ‘worst case’ X, and then choose the labels 

l depending on the algorithm A. Choose c points in the metric space, so that 
all the inner-point distances are at least do- Choose the n candidates to be 
divided equally among these locations. Until a search algorithm finds the first 
target, it receives only no answers from the recognition oracle. Therefore, given a 
specific algorithm A and the set X, the sequence of attended candidates may be 
simulated under the assumption that the oracle returns only no answers. Choose 
an assignment of labels l that assigns T only to the group of candidates located 
in the point whose first appearance in that sequence is last. A will query the 
oracle at least c times before finding a target. 

2. We suggest the following simple algorithm, which suffices for the proof: 
FLNN- Farthest Labeled Nearest Neighbor: Given a set of candidates X = 
{x \, . . . , x n }, randomly choose the first candidate, query the oracle and label 
this candidate. Repeat iteratively, until a target is detected: for each unlabeled 
candidate x,, compute the distance dLi to the nearest labelled neighbor. Choose 
the candidate Xi for which dLi is maximum. Query the oracle to get its label. 
Let us show that FLNN finds the first target after at most c queries for all 
search tasks {X,l) from the family X ( j a C \ Take an arbitrary minimum-do-cover 
of X, Cd 0 (X). Let Xi be a target so that d{xi,xf) > do for every distractor Xj 
(such a Xi exists since dx > d 0 ). Let C be a covering element(C £ C do (X)) so 
that Xi £ C. Note that all candidates in C are targets. Excluding C, there are 
(c— 1) other covering elements in Cd 0 (X) with diameter < do- Since C contains 
a candidate whose distance from all distractors > do, FLNN will not query two 
distractor-candidates in one covering element (whose distance < do), before it 
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queries at least one candidate in C. Therefore, a target will be located after at 
most c queries. (It is possible that a target that is not in C will be found earlier, 
and then the algorithm stops even earlier.) ■ 

Note that no specific metric is considered in the above claim and proof. 
However, the cover size and the implied search difficulty depend on the partial 
description (features), which may be chosen depending on the application. 

Note also that FLNN does not need to know do and performs optimally (in 
the worst case) relative to the (unknown) difficulty of the task. 

Note that Cd T (X) is the tightest suggested upper-bound on the performance 
of FLNN for a task (XJ) for which its max-min-target-distance is g?t- Given a 
search task, naturally, we do not know who the targets are in advance and do 
not know dr- Nevertheless, we might know that the task belongs to a family of 
search tasks for which is greater than some d 0 . In this case we can compute 
Cdg(X), and predict an upper-bound on the queries required for FLNN. 

The problem of finding the minimum cover is NP-hard. Gonzalez [4] proposes 
a 2-approximation algorithm for the problem of clustering a data set minimizing 
the maximum inner-cluster distance, and proves it is the best approximation 
possible if P ^ NP. In our experiments we used a heuristic algorithm that pro- 
vided tighter upper bounds on the minimum cover size. Note also that according 
to the theorem, FLNN’s worst cases’ results may serve as a lower bound on the 
minimum cover size as well. 

Since computing the cover is hard, we also suggest a more simple measure for 
search difficulty. Given a bounded metric-space containing the candidates, cover 
all the space with covering elements with diameter d 0 - (For the m-dimensional 
bounded Euclidean metric space [0, l] m , there are such elements.) The 

number of non-empty such covering elements is an upper-bound on the minimal 
cover size. See [2] for more results and a more detailed discussion. 

4 Dynamic Search Algorithm Based on a Stochastic 
Model 

The FLNN algorithm suffers from several drawbacks. It relates only to the nea- 
rest neighbor, which makes it non-robust. A single attended distractor close to 
an undetected target, reduces the priority of this target and slows the search. 
Moreover, it does not extend naturally to finding more than one target, and to 
incorporating bottom-up and top-down information, when available. The alter- 
native algorithm suggested below addresses these problems. 

4.1 Statistic Dependencies Modelling 

Taking a stochastic approach, we model the object identities as binary random 
variables with possible values 0 (for non-target) or 1 (for target). 

Recall that objects associated with similar identities tend to be more visually 
similar than objects which are of different identities. To quantify this intuition, 
we set the covariance between two labels to be a monotonic descending function 
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7 of the feature-space-distance between them: cov(l(xi),l(xj)) = 7 (d(xi,Xj)), 
where X = {27,27, ■ ■ ■ , x n } is a set of partial descriptions (feature vectors) as- 
sociated with the set of candidates, l(xi) is the identity label of the candidate 
Xi, and d is a metric distance function. In our experiments we use an exponen- 
tially descending function (er d ( Xi ' x i '>/ dmax , where d m ax is the greatest distance 
between feature- vectors) , which seems to be a good approximation to the actual 
dependency (see Sect. 5.2). 



4.2 Dynamic Search Framework 



We propose a greedy approach to a dynamic search. At each iteration, estimate 
the probability of each unlabelled candidate to be a target using all the know- 
ledge available. Choose the candidate for which the estimated probability is the 
highest and apply the object recognition oracle on the corresponding sub-image. 

After the to— th iteration, to candidates, 27, 27, ■ ■ ■ , x m , were already handled 
and m labels, l(xi),l(x 2 ), . . . ,l(x m ) are known. We use these labels to estimate 
the conditional probability of the label l(x k ) of each unlabelled candidate Xk to 
be 1. 



p k = p{l{x k ) = 1 | l{x m )). 



(1) 



4.3 Minimum Mean Square Error Linear Estimation 

Now, note that the random variable Ik is binary and, therefore, its expected va- 
lue is equal to its probability to take the value 1. Estimating the expected value, 
conditioned on the known data, is generally a complex problem and requires kno- 
wledge about the labels’ joint distribution. We use a linear estimator minimizing 
the mean square error criterion, which needs only second order statistics. 

Given the measured random variables l( 27), l(x 2 ), . . . , l(x m ), we seek a linear 
estimate Ik of the unknown random variable l(x k ), h = + X^i 0^(27), which 

minimizes the minimum mean square error e = E({l(x k ) — Ik) 2 )- Solving a set 
of (Yule- Walker) equations [13] provides the following estimation: 

l k = E[l(x k )] + a t (l-E[l}) 1 (2) 

where l = (l(xi), l(x 2), . . . , l{x m )) and a = I? -1 • r. Rij , i,j = 1 , . . . , to and /y, 
i = 1 , . . . , to are given by = cov(l(xi), l( Xj)) and r,; = cov(l(xk),l(xi)). 

E(l k ) is the expected value of the label Ik, which is the prior probability for 
Xk to be a target. If there is no such knowledge, E(R) can be set to be uniform, 
i.e., — (where n is the number of candidates). If there is prior knowledge on 
the number of targets in the scene, E(lk) should be set to — (where m is the 
expected number of targets). 

The estimated label Ik is the conditional mean of a label l(xk) of an unclassi- 
fied candidate 2 7, and, therefore, may be interpreted as the probability of l(xk) 
to be 1 

Pk = P{l{xk) = T | l{ 27), . . . , l(x m )) - l k . 
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4.4 The Algorithm: Visual Search Using Linear Estimation — VSLE 

— Given a scene image, choose n sub-images to be candidates. 

— Extract the set of feature vectors X = {xi,X 2 , ■ ■ ■ x n }. 

— Calculate pairwise feature space distances and the implied covariances. 

— Select the first candidate/s randomly (or based on some prior knowledge). 

— In iteration m+ 1: 

• For each candidate Xk out of the n — m remaining candidates, estimate 
Ik £ [0, 1] based on the known labels l(x i), . . . , l(x m ) using equation 2. 

• Query the oracle on the candidate Xk for which Ik is maximum. 

• If enough targets were found - abort. 

Our goal is to minimize the expected search time, and the proposed algo- 
rithm, being greedy, cannot achieve an optimal solution. It is, however, optimal 
with respect to all other greedy methods (based on second order statistics), as 
it uses all the information collected in the search to make the decision. 

Note that clustered non-targets accelerate the search and even let the target 
pop-out when there is only a single non-target cluster. Clustered targets are 
found immediately after the first target is detected. 

As the covariance decreases with distance, estimating the labels only from 
their nearest (classified) neighbors is a valid approximation which accelerates 
the search. 

4.5 Combining Prior Information 

Bottom-up and top-down information may be naturally integrated by specifying 
the prior probabilities (or the prior means) according to either the saliency or the 
similarity to known models. Moreover, if the top-down information is available as 
k model images (one or more), we can simply add them as additional candidates 
that were examined before the actual search. Continuing the search from this 
point is naturally faster; see end of Sect. 5. 2. 

5 Experiments 

In order to test the ideas described so far, we conducted many experiments using 
images of different types, using different methods for candidates selection, and 
different features to partially describe the candidates. Below, we describe a few 
examples that demonstrate the relation between the algorithms’ performance 
and the tasks’ difficulty. 



5.1 FLNN and Minimum- Cover- Size 

The first set of experiments considers several search tasks and focus on their 
characterization using the proposed metric cover. Because calculating the mi- 
nimal cover size is computationally hard, we suggest several ways to bound it 
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from above and from below and show that combining these methods yields a 
very good approximation. In this context we also test the FLNN algorithm and 
demonstrate its guaranteed performance. Finally we provide the intuition ex- 
plaining why indeed harder search tasks are characterized by larger covers. 

The first three search tasks are built around the 100 images corresponding to 
the 100 objects in the COIL-100 database [12] in a single pose. We think of these 
images as candidates extracted from some larger image. The extracted features 
are first, second, and third Gaussian derivatives in five scales [14] resulting in 
feature vectors of length 45. A Euclidian metric is used as the feature space 
distance. The tasks differ in the choice of the target which was cups (10 targets), 
toy cars (10 targets) and toy animals (7 targets) in the three search tasks. 

The minimal cover size for every task is bounded as follows: First the minimal 
target-distractor distance, dr, is calculated. We developed a greedy heuristic 
algorithm which prefers sparse regions and provide a possibly non-tiglrt but 
always valid dr- cover ; see [2] for details. For the cups search task the cover size 
was, for example, 24. For all tasks, this algorithm provided smaller (and tighter) 
covers than those obtained with the 2-approximation algorithm suggested by 
Gonzalez [4], which for the cups task gave a cover of size 42. Both algorithms 
provide upper bounds on the size of the minimal cover. See table 1 for cover sizes. 
Being a rigorous 2-approximation, half of the latter upper bound value (42/2=21 
for the cups) is also a rigorous lower bound on the minimal cover size. Another 
lower bound may be found by running the FLNN algorithm itself, which, by 
theorem 1, needs no more than Cd T (X) queries to the oracle. By running the 
algorithm 100 times, starting from a different candidate each run and taking the 
largest number of queries required (18 for the cups task), we get the tightest 
lower bound; see table 1 where the average number of queries required by the 
FLNN is given as well. 

Note that the search for cars was the hardest. While the car targets are 
very similar to each other (which should ease the search), finding the first car 
is hard due to the presence of distractors which are very similar to the cars 
(dr is small). The cups are also similar to each other, but are dissimilar to 
the distractors, implying an easier search. On the other hand, the different toy 
animals are dissimilar, but as one of them is very dissimilar from all candidates, 
the task is easier as well. Note that the minimal cover size captures the variety 
of reasons characterizing search difficulty in a single scalar measure. 

We also experimented with images from the Berkeley hand segmented da- 
tabase [10] and used the segments as candidates; see Fig.l. Small segments are 
ignored, leaving us with 24 candidates in the elephants image and 30 candidates 
in the parasols image. The targets are the segments containing elephants and 
parasols, respectively. For those colored images we use color histograms as fea- 
ture vectors. In each segment (candidate), we extract the values of r+ b g+b and 
r+g+b d' om ea ch pixel, where r, g, and b are values from the RGB representation. 
Each of these two dimensions is divided into 8 bins, resulting a feature vector 
of length 64. Again, we use Euclidean metric for distance measure. (Using other 
histogram comparison methods, such as the ones suggested in [16] the results 
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Fig. 1. The elephants and parasols images taken from the Berkeley hand segmented 
database and the segmentations we used in our experiments, (colored images) 

Table 1. Experiment results for FLNN and cover size. The real value of minimal cover 
size is bounded from below by ‘FLNN worst’ and the half of ‘2-Approx. cover size’, and 
bounded from above by ‘Heuristic cover size’ and ‘2-Approx. cover size’. The rightmost 
column shows that VSLE improves the results of FLNN for finding the first target. 



Search 

task 


#of #of 
cand. targets 


FLNN FLNN 
worst mean 


Heuristic 2-Approx. 
cover size cover size 


Real 

cover size 


VSLE 

worst 


cups 


100 


10 


18 


8.97 


24 


42 


21-24 


15 


cars 


100 


10 


73 


33.02 


79 


88 


73-79 


39 


toy animals 


100 


7 


22 


9.06 


25 


42 


22-25 


13 


elephants 


24 


4 


9 


5.67 


9 


11 


9 


8 


parasols 


30 


6 


6 


3.17 


8 


13 


7-8 


4 



were similar.) See the results in Table 1. Although the mean results are usually 
not better than the mean results of a random search, the worst results are much 
better. 

5.2 VSLE and Covariance Characteristics 

The VSLE algorithm described in Sect. 4 was implemented and applied to the 
same five visual search tasks described in Sect. 5.1. See Fig. 2 for part of the 
results. Unlike FLNN which deals only with finding the first target, VSLE con- 
tinues and aims also to find the other targets. Moreover, in almost all the expe- 
riments we performed, VSLE was faster in finding the first target (both in the 
worst and the mean results). See the rightmost column in table 1. 

VSLE relies on the covariance between candidates’ labels. We use a covariance 
function that depends only on feature-space-distance, and argue that for many 
search tasks this function is monotonic descending in this distance. To check 
this assumption we estimate the covariance of labels vs. feature-space-distance 
of search tasks and confirmed for its validity; see Fig. 2 and [2]. 

We experimented with a preliminary version of integrated segmentation and 
search. An input image (see Fig. 3) was segmented using k means clustering in 
the RGB color space (using 6 clusters). All (146) connected components larger 
than 100 pixels served as candidates. The VSLE algorithm searched for the (7) 
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faces in the image, using a feature vector of length 4: each segment is represented 
by the mean values of red green and blue and the segment size. 

No prior information on size, shape color or location was used. Note that 
this search task is hard due to the presence of similarly colored objects in the 
background, and due to the presence of hands which share the same color but 
are not classified as targets. Note that in most runs six of the seven faces are 
detected after about one-sixtlr of the segments are examined. We deliberately 
chose a very crude segmentation, demonstrating that very good segmentation is 
not required for the proposed search mechanism. 

Using the method suggested in Sect. 4. 5, we incorporate top-down information 
and demonstrate it on the toy cars case: 3 toy cars which do not belong to the 
COIL-100 database are used as model targets. The search time was significantly 
reduced as expected; see Fig. 4. 

6 Discussion 

In this paper we considered the usage of inner-scene similarity for visual se- 
arch, and provided both measures for the difficulty of the search, and algorithms 
for implementing it. We took a quantitative approach, allowing us not only to 
optimize the search but also to quantitatively predict its performance. 

Interestingly, while we did not aim at modelling the HVS attention system, 
it turns out that it shares many of its properties, and in particular, is similar to 
Duncan and Humphreys’s model [3]. As such, our work can be considered as a 
quantification of their observations. Not surprisingly, our results also show that 
there is a continuity between the two poles of ‘pop-out’ and ‘sequential’ searches. 

While many search tasks rely only on top down or bottom up knowledge, 
inner scene similarities always help and may become the dominant source of 
knowledge when less is known about the target. Consider, for example the pa- 
rasols search task (Sect. 5). First, note that the targets take a significant image 
fraction, and cannot be salient. Then, the parasols are similar and different from 
the non-targets in their color, but if this color is unknown, they cannot be sear- 
ched using top-down information. More generally, considering a scene containing 
several objects of the same category, we argue that their sub-images are more 
similar than images of such objects taken in different times and places. This 
happens because the imaging conditions are more uniform and because the va- 
riability of objects is smaller in one place, (e.g, two randomly chosen trees are 
more likely to be of the same type if they are taken from the same area.) 

We are now working on building an overall automatic system that will com- 
bine the suggested algorithms (extended to use bottom-up and top-down in- 
formation) with grouping and object recognition methods. We also intend to 
continue analyzing search performance. We would like to be able to predict se- 
arch time for the VSLE algorithm, for instance, in a manner similar to that we 
have achieved for FLNN. While the measure of minimal cover size as a lower 
bound for the worst cases holds, we aim to suggest a tighter bound for cases 
that are statistically more common. 
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(a) (b) (c) (d) 



Fig. 2. VSLE and covariance vs. distance results, (a) VSLE results for the cups search 
task. The solid lines describe one typical run. Other runs, starting each time from a 
different candidate, are described by the size of the gray spots as a distribution in the 
(time, number of targets found) space. It is easy to find the first cup since most cups 
are different from non-targets. Most cups resemble and follow pretty fast, but there are 
three cups (two without a handle and one with a special pattern) that are different from 
the rest of the cups, and are found rather late, (b) Estimate of labels covariance vs. 
feature-space-distance for the cups search task, (c) VSLE results for the parasols search 
task. All the parasols are detected very fast, since their color is similar and differs from 
that of all other candidates, (d) Estimate of labels covariance vs. feature-space-distance 
for the parasols search task. 




(c) 



Fig. 3. VSLE applied on an automatic-color-segmented image to detect faces, (a) The 
input image (colored image) (b) Results of an automatic crude color-based segmenta- 
tion (c) VSLE results (see caption of figure 2 for what is shown in this graph). 





Fig. 4. VSLE using top-down information for the toy cars search task, (a) The three 
model images, (b) VSLE results without using the models, (c)results of extended VSLE 
using the model images. 
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Abstract. In this paper we describe the first stage of a new learning 
system for object detection and recognition. For our system we propose 
Boosting [5] as the underlying learning technique. This allows the use of 
very diverse sets of visual features in the learning process within a com- 
mon framework: Boosting — together with a weak hypotheses finder — 
may choose very inhomogeneous features as most relevant for combina- 
tion into a final hypothesis. As another advantage the weak hypotheses 
finder may search the weak hypotheses space without explicit calculation 
of all available hypotheses, reducing computation time. This contrasts 
the related work of Agarwal and Roth [1] where Winnow was used as 
learning algorithm and all weak hypotheses were calculated explicitly. 
In our first empirical evaluation we use four types of local descriptors: 
two basic ones consisting of a set of grayvalues and intensity moments 
and two high level descriptors: moment invariants [8] and SIFTs [12]. 
The descriptors are calculated from local patches detected by an inter- 
est point operator. The weak hypotheses finder selects one of the local 
patches and one type of local descriptor and efficiently searches for the 
most discriminative similarity threshold. This differs from other work on 
Boosting for object recognition where simple rectangular hypotheses [22] 
or complex classifiers [20] have been used. In relatively simple images, 
where the objects are prominent, our approach yields results comparable 
to the state-of-the-art [3]. But we also obtain very good results on more 
complex images, where the objects are located in arbitrary positions, 
poses, and scales in the images. These results indicate that our flexible 
approach, which also allows the inclusion of features from segmented re- 
gions and even spatial relationships, leads us a significant step towards 
generic object recognition. 



1 Introduction 

We believe that a learning component is a necessary part of any generic ob- 
ject recognition system. In this paper we investigate a principle approach for 
learning objects in still images which allows the use of flexible and extendible 
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sets of features for describing objects and object categories. Objects should be 
recognized even if they occur at abitrary scale, shown from different perspective 
views on highly textured backgrounds. Our main learning technique relies on 
Boosting [5]. Boosting is a technique for combining several weak classifiers into 
a final strong classifier. The weak classifiers are calculated on different weight- 
ings of the training examples to emphasize different portions of the training set. 
Since any classification function can potentially serve as a weak classifier we can 
use classifiers based on arbitrary and inhomogeneous sets of image features. A 
further advantage of Boosting is that weak classifiers may be calculated when 
needed instead of calculating unnecessary hypotheses a priori. 

In our learning setting, the learning algorithm needs to learn an object cat- 
egory. It is provided with a set of labeled training images, where a positive 
label indicates that a relevant object appears in the image. The objects are not 
segmented and pose and location are unknown. As output, the learning algo- 
rithm delivers a final classifier which predicts if a relevant object is present in 
a new image. Having such a classifier, the localization of the object in the im- 
age is straightforward. The image analysis transforms images to grey values and 
extracts normalised regions around interest (salient) points to obtain reduced 
representations of images. As an appropriate representation for the learning pro- 
cedure we calculate local descriptors of these patches. The result of the training 
procedure is saved as the final hypothesis which is later used for testing (see 
figure 1). 



Images 



Region Detection 



Boosting 




Fig. 1 . Overview showing the framework for our approach for generic object recogni- 
tion. The solid arrows show the training cycle, the dotted ones the testing procedure. 



We describe our general learning approach in detail in section 2. In section 
3, we discuss the image analysis steps, including illumination and size normali- 
sation, interest point detection, and the extraction of the local descriptors. An 
explicit explanation of how we calculate the weak hypotheses used by the Boost- 
ing algorithm, is given in section 4. Section 5 contains a description of the setup 
we used for our experiments. The results are presented and compared with other 
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approaches for object recognition. We regard the present system as a first step 
and further work is outlined in section 6. 



1.1 Related Work 

Clearly there is an extensive body of literature on object recognition (e.g. [3], 
[2], [22], [24], [6], [14]). In general, these approaches use image databases which 
show the object of interest at prominent scales and with only little variation in 
pose. We discuss only some of the most relevant and most recent results related 
to our approach. 

Boosting was successfully used by Viola and Jones [22] as the ingredient for a 
fast face detector. The weak hypotheses were the thresholded average brightness 
of collections of up to four rectangular regions. In our approach we experiment 
with much larger sets of features to be able to perform recognition of a wider 
class of objects. 

Sclmeiderman and Kanade [20] used Boosting to improve an already complex 
classifier. In contrast, we are using Boosting to combine rather simple classifiers 
by selecting the most discriminative features. 

Agarwal and Roth [1] used Winnow as the underlying learning algorithm for 
the recognition of cars from side views. For this purpose images were represented 
as binary feature vectors. The bits of such a feature vector can be seen as the 
outcomes of weak classifiers, one weak classifier for each position in the binary 
vector. Thus for learning it is required that the outcomes of all weak classifiers 
are calculated a priori. In contrast, Boosting only needs to find the few weak 
classifiers which actually appear in the final classifier. This substantially speeds 
up learning, if the space of weak classifiers carries a structure which allows the 
efficient search for discriminative weak classifiers. A simple example is a weak 
classifier which compares a real valued feature against a threshold. For Winnow, 
one weak classifier needs to be calculated for each possible threshold a priori 1 , 
whereas for Boosting the optimal threshold can be determined efficiently when 
needed. 

A different approach to object class recognition was presented by Fergus, 
Perona, and Zisserman [3]. They used a generative probabilistic model for ob- 
jects built as constellations of parts. Using an EM-type learning algorithm they 
achieved very good recognition performance. In our work we have chosen a 
model-free approach for flexibility. If at all, the sets of weak classifiers we use can 
be seen as model classes, but with much less structure than in [3]. Furthermore, 
we propose Boosting as a very different learning algorithm from EM. 

Dorko and Schmid [2] introduced an approach for constructing and selecting 
scale-invariant object parts. These parts are subsequently used to learn a classi- 
fier. They show a robust detection under scale changes and variations in viewing 
conditions, but in contrast to our approach, the objects of interest are manu- 
ally pre-segmented. This dramatically reduces the complexity of distinguishing 
between relevant patches on the objects and background clutter. 

1 More efficient techniques for Winnow like using virtual threshold gates [13] do not 
improve the situation much. 
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2 Our Learning Model for Object Recognition 

In our setup, a learning algorithm has to recognize objects from a certain cate- 
gory in still images. For this purpose, the learning algorithm delivers a classifier 
that predicts whether a given image contains an object from this category or 
not. As training data, labeled images (4,4), . . . , are provided for the 

learning algorithm where 4 = +1 if 4 contains a relevant object and i k = — 1 
if Jfc contains no relevant object. Now the learning algorithm delivers a function 
H : / K I which predicts the label of image I. To calculate this classification 
function H we use the classical AdaBoost algorithm [5] . AdaBoost puts weights 
Wk on the training images and requires the construction of a weak hypothesis h 
which has some discriminative power relative to these weights, i.e. 

53 Wk > 53 Wk ’ W 

k:h(I k )=t k k-.h(i k )jte k 

such that more images are correctly classified than misclassified, relative to the 
weights Wk- (Such a hypothesis is called weak since it needs to satisfy only a 
very weak requirement.) The process of putting weights and constructing a weak 
hypothesis is iterated for several rounds t = 1 , . . . , T, and the weak hypotheses 
ht of each round are combined into the final hypothesis H. 

In each round t the weight Wk is decreased if the prediction for Ik was correct 
( h t (Ik ) = 4), and increased if the prediction was incorrect. Different to the 
standard AdaBoost algorithm we vary the factor p t to trade off precision and 
recall. We set 



Pt 




if 4 
else 



+1 and 4 7^ 4 (4)- 



with e being the error of the weak hypothesis in this round and 77 as an additional 
weight factor to control the update of wrongly classified positive examples. 

Here two general comments are in place. First, it is intuitively quite clear 
that weak hypotheses with high discriminative power - with a large difference 
of the sums in (1) — are preferable, and indeed this is shown in the convergence 
proof of AdaBoost [5]. Second, the adaptation of the weights Wk in each round 
performs some sort of adaptive decorrelation of the weak hypotheses: if an image 
was correctly classified in round t, then its weight is decreased and less emphasis 
is put on this image in the next round, yielding quite different hypotheses h t 
and ht+ 1 - 2 Thus it can be expected that the first few weak hypotheses char- 
acterize the object category under consideration quite well. This is particularly 
interesting when a sparse representation of the object category is needed. 

Obviously AdaBoost is a very general learning technique for obtaining classi- 
fication functions. To adapt it for a specific application, suitable weak hypotheses 

2 In fact AdaBoost sets the weights in such a way that h t is not discriminative in 
respect to the new weights. Thus ht is in some sense oblivious to the predictions of 

4+1. 
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have to be constructed. For the purpose of object recognition we need to extract 
suitable features from images and use these features to construct the weak hy- 
potheses. Since AdaBoost is a general learning technique we are free to choose 
any type of features we like, as long as we are able to provide an effective weak 
hypotheses finder which returns discriminative weak hypotheses based on this 
set of features. The chosen features should be able to represent the content of 
images, at least in respect to the object category under consideration. Since we 
may choose several types of features, we represent an image / by a set of pairs 
7 Z(I) = {(t, u)} where r denotes the type of a feature and v denotes a value of 
this feature, typically a vector of reals. Then for AdaBoost a weak hypothesis 
is constructed from the representations TZ (Ik), labels Ik, and weights Wk of the 
training images. 

In the next section we describe the types of features we are currently using, 
although many other features could be used, too. In Section 4 we describe the 
effective construction of the weak hypotheses. 



3 Image Analysis and Feature Construction 

We extract features from raw images, ignoring the labels used for learning. To 
lower the number of the points in an image we have to attend to, we use an 
interest point detector to get salient points. We evaluate three different detec- 
tors, a scale invariant interest point detector, an affine invariant interest point 
detector, and the SIFT interest point detector ([15], [16], [12], see section 3.1). 
Using these salient points we can reduce the content of an image to a number of 
points (and their surroundings) while being robust against irrelevant variations 
in illumination and scale. Since the most salient points 3 may not belong to the 
relevant objects, we have to take a rather large number of points into account, 
which implies choosing a low threshold in the interest point detectors. The num- 
ber of SIFTs is reduced by a vector quantization using k-means (similarly to 
Fergus et al. [3]). The pixels enclosing an interest point are refered to as a patch. 
Due to different illumination conditions we normalise each patch before the local 
descriptors are calculated. Representing patches through a local descriptor can 
be done in different ways. We use subsampled grayvalues, intensity moments, 
Moment Invariants and SIFTs here. 



3.1 Interest Point Detection 

There is a variety of work on interest point detection at fixed (e.g. [9,21,25,10]), 
and at varying scales (e.g. [11,15,16]). Based on the evaluation of interest point 
detectors by Schmid et al. [19], we decided to use the scale invariant Harris- 
Laplace detector [15] and the affine invariant interest point detector [16], both 
by Mikolajczyk and Schmid. In addition we use the interest point detector used 
by Lowe [12] because it is strongly interrelated with SIFTs as local descriptors. 

3 E.g. by measuring the entropy of the histogram in the surrounding [3] or doing a 
Principal Component Analysis. 
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The scale invariant detector finds interest points by calculating a scaled ver- 
sion of the second moment matrix M and localizing points where the Harris 
Measure H = det(M) — atrace 2 (M) is above a certain threshold th. The char- 
acteristic scale for each of these points is found in scale-space by calculating the 
Laplacians L(x,cr) = \a 2 (L xx (x,a) + L yy (x,a))\ for each desired scale a and 
taking the one at which L has a maximum in an 8-neiglrbourhood of the point. 

The affine invariant detector is also based on the second moment matrix 
computed at a point which can be used to normalise a region in an affine invariant 
way. The characteristic scale is again obtained by selecting the scale at which the 
Laplacian has a maximum. An iterative algorithm is then used which converges 
to affine invariant points by modifying the location, scale and neighbourhood of 
each point. 

Lowe introduced an interest point detector invariant to translation, scaling 
and rotation and minimally affected by small distortions and noise [12]. He also 
uses the scale-space but built with a difference of Gaussian (DoG) . Additionally, 
a scale pyramid achieved by bilinear interpolation is employed. Calculating the 
image gradient magnitude and the orientation at each point of the scale pyramid, 
salient points with characteristic scales and orientations are achieved. 

3.2 Region Normalisation 

To normalise the patches we have to consider illumination, scale and affine trans- 
formations. For the size normalisation we have decided to use quadratic patches 
with a side of l pixels. The value of l is a variable we vary in our experiments. 
We extract a window of size w = 6 * <tj where 07 is the characteristic scale of 
the interest point delivered by the interest point detector. Scale normalisation is 
done by smoothing and subsampling in cases of l < w and by linear interpolation 
otherwise. In order to obtain affine invariant patches the values of the transfor- 
mation matrix resulting from the affine invariant interest point detector are used 
to normalise the window to the shape of a square, before the size normalisation. 

For illumination normalisation we use Homomorphic Filtering (see e.g. [7], 
chapter 4.5). The Homomorphic Filter is based on an image formation model 
where the image intensity I(x,y) = i(x, y)r(x,y) is modeled as the product of 
illumination i(x,y) and reflectance r(x, y). Elimination of the illumination part 
leads to a normalisation. This is achieved by applying a Fast Fourier Transform 
to the logarithm image ln{I). Now the reflectance component can be separated 
by a high pass filter. After a back transformation and an exponentiation we get 
the desired normalised patch. 

3.3 Feature Extraction 

To represent each patch we have to choose some local descriptors. Local descrip- 
tors have been researched quite well (e.g. [4], [12], [18], [8]). We selected four 
local descriptors for our patches. Our first descriptor is simply a vector of all 
pixels in a patch subsampled by two. The dimension of this vector is ] which is 
rather high and increases computational complexity. As a second descriptor we 
use intensity moments = J f u i(x,y) a x p y q dx dy with a as the degree and 
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p + q as the order, up to degree 2 and order 2. Without using the moments of 
degree 0 we get a feature vector with a dimension of 10. This reduces the compu- 
tational costs dramatically. With respect to the performance evaluation of local 
descriptors done by Mikolajczyk and Schmid [17] we took SIFTs (see [12]) as a 
third and Moment Invariants (see [8]) as a fourth choice. In this evaluation the 
SIFTs outmatched the others in nearly all tests and the Moment Invariants were 
in the middle ground for all aspects considered. 

According to [8] we selected first and second order Moment Invariants. We 
chose the first order affine Invariant and four first order affine and photometric 
Invariants. Additionally we took all five second order Invariants described in [8]. 
Since the Invariants require two contours, the whole square patch is taken as one 
contour and rectangles corresponding to one half of the patch are used as a second 
contour. All four possibilities of the second contour are calculated and used to 
obtain the Invariants. The dimenson of the Moment Invariants description vector 
is 10. 

As shown in [12] the description of the patches with SIFTs is done by mul- 
tiple representations in various orientation planes. These orientation planes are 
blurred and resampled to allow larger shifts in positions of the gradients. A local 
descriptor with a dimension of 128 is obtained here for a circular region around 
the point with a radius of 8 pixels, 8 orientation planes and sampling over a 4x4 
and a 2x2 grid of locations. 



4 Calculation of Weak Hypotheses 



Using the features constructed in the previous section, an image is represented 
by a list of features (t/,d/), / = 1 where r/ denotes the type of a 

feature, Vf denotes its value as real vector, and F is the number of extracted 
features in an image. The weak hypotheses for AdaBoost are calculated from 
these features. For object recognition we have chosen weak hypotheses which 
indicate if certain feature values appear in images. For this a weak hypothesis h 
has to select a feature type r, its value v, and a similarity threshold 6. The 
threshold 6 decides if an image contains a feature value Vf that is sufficiently 
similar to v. The similarity between Vf and v is calculated by the Mahalanobis 
distance for Moment Invariants and by the Euclidean distance for SIFTs. The 
weak hypotheses finder searches for the optimal weak hypothesis — given labeled 
representations of the training images (7Z (Ii), £i ), . . . , (1Z (I m ), (-m) and their 
weights w i , . . . , w m calculated by AdaBoost — among all possible feature values 
and corresponding thresholds. 

The main computational burden is the calculation of the distances between 
Vf and v, since they both range over all feature values that appear in the training 
images. 4 Given these distances which can be calculated prior to Boosting, the 
remaining calculations are relatively inexpensive. Details for the weak hypotheses 
finder are given in Figure 2. After sorting the optimal threshold for feature 
( Tkj,Vkj ) can now be calculated in time 0(m) by scanning through the weights 

4 We discuss possible improvements in Section 6. 
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Input: Labeled representations (7 Z (Ik),lk), 
k = 1 ,. ■ • ,m, TL{I k ) = {(r k j,Vk,f) ■■ f = 1 , • 

Distance functions: Let d T (-, •) be the distance in respect to the feature values of 
type t in the training images. 

Minimal distance matrix: For all features (jkj,Vk,f) and all images Ij calculate 
the minimal distance between Vkj and features in Ij, 

dk,f,j= ^ min d Tkf {v kJ ,Vj, g ). 

l<g<Fj-.T j:g =T kJ 

Sorting: For each k, f let 70t,/(l), • • • , iTk,f{rn) be a permutation such that 

dk,f,ir k: f( 1 ) £'''£• dkJ,TT k j(m) ■ 

Select best weak hypothesis (Scanline): For all features ( T k j,v k j ) calculate 
over all images Ij 

S 

maX ^ * V ,(0 ' 

i=l 

and select the feature {rij,Vij) where the maximum is achieved. Select threshold 
8: With the position s where the scanline reached a maxium sum the threshold 8 is 
set to 

_ dk,f,ir k j(s) ~ dk,f,n k j(s+ 1) 



Fig. 2. Explanation of the weak hypotheses finder. 



w i, . . . ,w m in the order of the distances d k j,j ■ Searching over all features, the 
calculation of the optimal weak hypothesis takes O(Fm) time. 

To give an example of the absolute computation times we used a dataset 
of 150 positive and 150 negative images. Each image has an average number 
of approximately 400 patches. Using SIFTs one iteration after preprocessing 
requires about one minute computation time on a P4, 2.4GHz PC. 



5 Experimental Setup and Results 

We carried out our experiments as follows: the whole approach was first tested 
on the database used by Fergus et al. [3]. After demonstrating a comparable 
performance, the approach was tested on a new, more difficult database 5 , see 
figure 5. These images contain the objects at arbitrary scales and poses. The 
images also contain highly textured background. Testing on these images shows 
that our approach still performs well. We have used two categories of objects, 
persons (P) and bikes (B), and images containing none of these objects (N). Our 
database contains 450 images of category P, 350 of B and 250 of category N. 
The recognition was based on deciding presence or absence of a relevant object. 

5 Available at http : // www.emt.tugraz.at/ ~ pinz/data/ 
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Preparing our data set we randomly chose a number of images, half belonging 
to the object category we want to learn and half not. From each of these two 
piles we take one third of the images as a set of images for testing the achieved 
model. The performance was measured with the receiver-operating characteristic 
(ROC) corresponding error rate. We tested the images containing the object (e.g. 
category B) against non-object images from the database (e.g. categories P and 
N). Our training set contains 100 positive and 100 negative images. The tests 
are carried out on 100 new images, half belonging to the learned class and half 
not. Each experiment was done using just one type of local descriptor. 

Figure 3(a) shows the recall-precision curve (RPC) of our approach (obtained 
by varying rf), the approach of Fergus et al. [3] and the one of Agarwal and Roth 
[1], trained on the dataset used by Fergus et al. [3] 6 . Our approach performs 
better than the one of Agarwal and Roth but slightly worse than the approach 
of Fergus et al. 

Table 1 shows the results of our approach (using the affine invariant interest 
point detection and Moment Invariants) compared with the ones of Fergus et 
al. and other methods [23], [24], [1]. While they use a kind of scale and viewing 
direction normalisation (see [3]), we work on the original images. Our results are 
almost as good as the results of Fergus et al. for the motorbikes dataset. For the 
other datasets our error rate is somewhat higher than the one of Fergus et al., 
but mostly lower than the error rate of the other methods. 



Table 1 . The table gives the ROC equal error rates on a number of datasets from 
the database used by Fergus et al. [3]. Our results (using the affine invariant interest 
point detection and Moment Invariants) are compared with the results of the approach 
of Fergus et al. and other methods [23], [24], [1]. The error rates of our algorithm are 
between the other approaches and the ones of Fergus et al. in all cases except for the 
faces where the algorithm of Weber et al. [24] is also slightly better. 



Dataset | Ours 


Fergus et al. [3] 


Others 


Ref. 


Motorbikes 


92.2 


92.5 


84 


[23] 


Airplanes 


88.9 


90.2 


68 


[23] 


Faces 


93.5 


96.4 


94 


[24] 


Cars(Side) 


83.0 


88.5 


79 


[1] 



This comparison shows that our approach performs well on the Fergus et 
al. database. We proceed with experiments on our own dataset and show some 
effects of parameter tuning 7 . Figure 3(b) shows the influence of the additional 
weighting of right positive examples in the Boosting algorithm (?y). We can see 
that with a factor ?y smaller than 1.8, the recall increases faster than the precision 

6 Available at http : // www.robots.ox.ac.uk/ ~ vgg/data/ 

' Parametes not given in these tests are set to g = 1.8, T = 50, l = 16 px, th = 
30000, smallest scale is skipped. Depending on textured/homogenous background, 
the number of interest points detected in an image varies between 50 and 1000. 
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(a) (b) 



Fig. 3. The curves in (a) and (b) are obtained by varying the factor r /. In (a) the 
diagram shows the recall-precision curve for [3] , [1] and our approach on the cars (side) 
dataset. Our approach is superior to the one of Agarwal and Roth but slightly worse 
than the one of Fergus et al. The diagram (b) shows the influence of an additional 
factor 77 for the weights of correctly positive classified examples. The recall increases 
faster than the precision drops until a factor of 1.8. 



drops. Then both curves have nearly the same (but inverse) gradient up to a 
factor of 3. For 77 > 3 the precision decreases rapidly with no relevant gain of 
recall. 

Table 2 presents the performance of the Moment Invarants as local descriptor, 
compared with our low level descriptors (using the affine invariant interest point 
detector). Moment Invariants delivered the best results but the other low level 
descriptors did not perform badly, either. This behaviour might be explained by 
the fact that the extracted regions are already normalised against the same set 
of transformations as the Moment Invariants. 



Table 2. The table shows the results we reached with the three different kinds of 
local descriptors. We used an additional weight factor 77 = 1.7 here. Moment Invariants 
delivered the best results. 



Local Descriptor || recall 


precision 


Moment Invariants 


0.88 


0.61 


Intensity Moments 


0.70 


0.57 


Subsampled Grayvalues 


0.82 


0.62 



In table 3 the results of our approach using the scale invariant interest point 
detector compared with the use of the affine invariant interest point detector are 
shown. We also vary the additional weight for right positive classified examples 
77 . The affine invariant interest point detector achieves better results for the recall 
but precision is higher when we use the scale invariant version of the interest 
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(a) (b) 



Fig. 4. In (a) the recall-precision curve of our approach with Moment Invariants and the 
affine invariant interest point detection, and the recall-precision curve of our approach 
using SIFTs for category bike are shown, (b) shows the recall-precision curves with the 
same methods for the category person. 



point detector. This is to be expected since the affine invariant detector allows 
for more variation in the image, which implies higher recall but less precision. 



Table 3. The table shows the results of our approach using the scale invariant interest 
point detector compared with using the affine invariant interest point detector varying 
the additional weight for right positive classified examples 77 . 



r i 1 1 recall (scale inv.) [precision (scale inv.)|recall (affine inv.) [precision (affine inv.J| 



1.7 


0.78 


0.70 


0.88 


0.61 


1.9 


0.79 


0.64 


0.92 


0.59 


2.1 


0.82 


0.62 


0.94 


0.57 



We skipped the smallest scale in our experiments because experiments show 
that this reduction of number of points does not have relevant influence to the 
error rates. Again, using the parameters that performed best, figure 4(a) shows 
an example of a recall-precision curve (RPC) of our approach trained on the 
bike dataset from our image database with Moment Invariants and the affine 
invariant interest point detection compared with our approach using SIFTs. 
Using the same methods we obtain the recall-precision curves (RPC) shown in 
figure 4(b) for the category person. 

For directly comparing the results reached using the Moment Invariants with 
the affine invariant interest point detector or using SIFTs, the ROC equal error 
rates on various datasets are shown in table 4. As seen here the SIFTs perform 
better on our database. Tested on a category of the database from Fergus et 
al. one can see that the Moment Invariants perform better in that case. 
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Table 4. This table shows a comparison of the ROC equal error rates reached with 
the two high level features. On our database the SIFTs perform better, but on the 
database of Fergus et al. the Moment Invariants reach the better error rate. 



Dataset II Moment Invariants I SIFTs 



Airplanes 


88.9 


80.5 


Bikes 


76.5 


86.5 


Persons 


68.7 


80.8 



6 Discussion and Outlook 

In conclusion, we have presented a novel approach for the detection and recog- 
nition of object categories in still images. Our system uses several steps of image 
analysis and feature extraction, which have been previously described, but suc- 
ceeds on rather complex images with a lot of background structure. Objects are 
shown in substantially different poses and scales, and in many of the images the 
objects (bikes or persons) cover only a small portion of the whole image. The 
main contribution of the paper, however, lies in the new concept of learning. 
We use Boosting as the underlying learning technique and combine it with a 
weak hypothesis finder. In addition to several other advantages of this approach, 
which have already been mentioned, we want to emphasize that this approach 
allows the formation of very diverse visual features into a final hypothesis. We 
think that this capability is the main reason for the good experimental results on 
our complex database. Furthermore, experimental comparison on the database 
used by Fergus et al. [3] shows that our approach performs similarly well to 
state-of-the-art object categorization on simpler images. 

We are currently investigating extensions of our approach in several direc- 
tions. Maybe the most obvious is the addition of more features to our image 
analysis. This includes not only other local descriptors like differential invari- 
ants [12], but also regional features 8 and geometric features 9 . To reduce the 
complexity of our approach we are considering a reduction of the number of 
features by clustering methods. 

As the next step we will use spatial relations between features to improve 
the accuracy of our object detector. To handle the complexity of many possible 
relations between features, we will use the features constructed in our current 
approach (with parameters set for high recall) as starting points. Boosting will 
again be the underlying method for learning object representations as spatial 
combinations of features. This will allow the construction of weak hypotheses 
for discriminative spatial relations. 



Acknowledgements. This work was supported by the European project LAVA 
(IST-2001-34405) and by the Austrian Science Foundation (FWF, project S9103- 

8 Regional features describe regions found by appearance based clustering. 

9 A geometric feature describes the appearance of geometric shapes, e.g. ellipses, in 
images. 






Fig. 5. Examples from our image data base. The first column shows three images from 
the object class bike, the second column contains objects from the class person and the 
images in the last column belong to none of the classes (called nobikenoperson) . The 
second example in the last column shows a moped as a very difficult counter-example 
to the category of bikes. 



N04). We are grateful to David Lowe and Cordelia Schmid for providing the code 
for their detectors/descriptors. 
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Abstract. We describe a method for automatically associating image patches 
from frames of a movie shot into object-level groups. The method employs both 
the appearance and motion of the patches. 

There are two areas of innovation: first, affine invariant regions are used to repair 
short gaps in individual tracks and also to join sets of tracks across occlusions 
(where many tracks are lost simultaneously); second, a robust affine factoriza- 
tion method is developed which is able to cope with motion degeneracy. This 
factorization is used to associate tracks into object-level groups. 

The outcome is that separate parts of an object that are never visible simultaneously 
in a single frame are associated together. For example, the front and back of a car, 
or the front and side of a face. In turn this enables object-level matching and 
recognition throughout a video. 

We illustrate the method for a number of shots from the feature film ‘Groundhog 
Day'. 



1 Introduction 

The objective of this work is to automatically extract and group independently moving 
3D semi-rigid (that is, rigid or slowly deforming) objects from video shots. The principal 
reason we are interested in this is that we wish to be able to match such objects throughout 
a video or feature length film. An object, such as a vehicle, may be seen from one aspect 
in a particular shot (e.g. the side of the vehicle) and from a different aspect (e.g. the 
front) in another shot. Our aim is to learn multi-aspect object models [19] from shots 
which cover several visual aspects, and thereby enable object level matching. 

In a video or film shot the object of interest is usually tracked by the camera — 
think of a car being driven down a road, and the camera panning to follow it, or tracking 
with it. The fact that the camera motion follows the object motion has several beneficial 
effects for us: the background changes systematically, and may often be motion blurred 
(and so features are not detected there); and, the regions of the object are present in the 
frames of the shot for longer than other regions. Consequently, object level grouping can 
be achieved by determining the regions that are most common throughout the shot. 

In more detail we define object level grouping as determining the set of appearance 
patches which (a) last for a significant number of frames, and (b) move (semi-rigidly) 
together throughout the shot. In particular (a) requires that every appearance of a patch is 
identified and linked, which in turn requires extended tracks for a patch - even associating 
patches across partial and complete occlusions. Such thoroughness has two benefits: first, 
the number of frames in which a patch appears really does correspond to the time that 
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it is visible in the shot, and so is a measure of its importance. Second, developing very 
long tracks significantly reduces the degeneracy problems which plague structure and 
motion estimation [5], 

The innovation here is to use both motion and appearance consistency throughout the 
shot in order to group objects. The technology we employ to obtain appearance patches 
is that of affine co-variant regions [9,10,1 1,1 8J. These regions deform with viewpoint so 
that their pre-image corresponds to the same surface patch. 

To achieve object level grouping we have developed the state of the art in two areas: 
first, the affine invariant tracked regions are used to repair short gaps in tracks (section 3) 
and also associate tracks when the object is partially or totally occluded for a period 
(section 5). The result is that regions are matched throughout the shot whenever they 
appear. Second, we develop a method of robust affine factorization (section 4) which is 
able to handle degenerate motions [17] in addition to the usual problems of missing and 
mis-matched points [1,3,7,13]. 

The task we carry out differs from that of layer extraction [ 16], or dominant motion 
detection where generally 2D planes are extracted, though we build on these approaches. 
Here the object may be 3D, and we pay attention to this, and also it may not always be 
the foreground layer as it can be partially or totally occluded for part of the sequence. 

In section 6 we demonstrate that the automatically recovered object groupings are 
sufficient to support object level matching throughout the feature film 'Groundhog Day’ 
[Ramis, 1993]. This naturally extends the frame based matching of 'Video Google’ [14]. 



2 Basic Segmentation and Tracking 



Affine invariant regions. Two types of affine invariant region detector are used: one 
based on interest point neighborhoods [10,11], the other based on the “Maximally Stable 
Extremal Regions” (MSER) approach of Matas et al. [9], In both the detected region is 
represented by an ellipse. Implementation details of these two methods are given in the 
citations. 

It is beneficial to have more than one type of region detector because in some imaged 
locations a particular type of feature may not occur at all. Here we have the benefit of 
region detectors firing at points where there is signal variation in more than one direction 
(e.g. near “blobs” or “corners”), as well as at high contrast extended regions. These two 
image areas are quite complementary. The union of both provides a good coverage of 
the image provided it is at least lightly textured, as can be seen in figure 1 . The number 
of regions and coverage depends of course on the visual richness of the image. 

To obtain tracks throughout a shot, regions are first detected independently in each 
frame. The tracking then proceeds sequentially, looking at only two consecutive frames 
at a time. The objective is to obtain correct matches between the frames which can 
then be extended to multi-frame tracks. It is here that we benefit significantly from the 
affine invariant regions: first, incorrect matches can be removed by requiring consistency 
with multiple view geometric relations: the robust estimation of these relations for point 
matches is very mature [6] and can be applied to the region centroids; second, the regions 
can be matched on their appearance. The latter is far more discriminating and invariant 
than the usual cross-correlation over a square window used in interest point trackers. 
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(■) (b) (c) 



Fig. 1. Example of affine invariant region detection, (a) frame number 8226 from ‘Groundhog 
Day’, (b) ellipses formed from 722 affine invariant interest points, (c) ellipses formed from 1269 
MSER regions. Note the sheer number of regions detected just in a single frame, and also that the 
two types of region detectors fire at different and complementary image locations. 



Tracker implementation. In a pair of consecutive frames, detected regions in the first 
frame are putatively matched with all detected regions in the second frame, within a 
generous disparity threshold of 50 pixels. Many of these putative matches will be wrong 
and an intensity correlation computed over the area of the elliptical region removes 
all putative matches with a normalized cross correlation below 0.90. The 1 -parameter 
(rotation) ambiguity between regions is assumed to be close to zero, because there will 
be little cyclo-torsion between consecutive frames. All matches that are ambiguous, i.e. 
those that putatively match several features in the other frame, are eliminated. 

Finally epipolar geometry is fitted between the two views using RANSAC with a 
generous inlier threshold of 3 pixels. This step is very effective in removing outlying 
matches whilst not eliminating the independent motions which occur between the two 
frames. 

The results of this tracking on a shot from the movie ‘Groundhog Day’ are shown 
in figure 3b. This shot is used throughout the paper to illustrate the stages of the object 
level grouping. Note that the tracks have very few outliers. 



3 Short Range Track Repair 

The simple region tracker of the previous section can fail for a number of reasons most 
of which are common to all such feature trackers: (i) no region (feature) is detected in a 
frame - the region falls below some threshold of detection (e.g. due to motion blur); (ii) 
a region is detected but not matched due to a slightly different shape; and, (iii) partial or 
total occlusion. 

The causes (i) and (ii) can be overcome by short range track repair using motion 
and appearance, and we discuss this now. Cause (iii) can be overcome by wide baseline 
matching on motion grouped objects within one shot, and discussion of this is postponed 
until section 5. 

3.1 Track Repair by Region Propagation 

The goal of the track repair is to improve tracking performance in cases where region 
detection or the first stage tracking fails. The method will be explained for the case of a 
one frame extension, the other short range cases (2-5 frames) are analogous. 
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Fig. 2. (a) Histogram of track lengths for the shot shown in figure 3 for basic tracking (section 2) 
and after short range track repair (section 3 ). Note the improvement in track length after the repair 
- the weight of the histogram has shifted to the right from the mode at 10. (b) The sparsity pattern 
of the tracked features in the same shot. The tracks are coloured according to the independently 
moving objects, that they belong to, as described in section 4. The two gray blocks (track numbers 
1-1808 and 2546-5011) correspond to the two background objects. The red and green blocks 
(1809-2415 and 2416-2545 respectively) correspond to van object before and after the occlusion. 



The repair algorithm works on pairs of neighboring frames and attempts to extend 
already existing tracks which terminate in the current frame. Each region which has 
been successfully tracked for more than n frames and for which the track terminates 
in the current frame is propagated to the next frame. The propagating transformation is 
estimated from a set of k spatially neighboring tracks (here n = 5 and k = 5). In the 
case of successive frames only translational motion is estimated from the neighboring 
tracks. In the case of more separated frames the full affine transformation imposed by 
each tracked region should be employed. 

It must now be decided if there is a detectable region near the propagated point, and 
if it matches an existing region. The refinement algorithm of Ferrari et al. [4] is used 
to fit the region locally in the new frame (this searches a hypercube in the 6D space 
of affine transformations by a sequence of line searches along each dimension). If the 
refined region correlates sufficiently with the original, then a new region is instantiated. 
It is here that the advantage of regions over interest points is manifest: this verification 
test takes account of local deformations due to viewpoint change, and is very reliable. 

The standard ‘book-keeping’ cases then follow: (i) no new region is instantiated (e.g. 
the region may be occluded in the frame); (ii) a new region is instantiated, in which case 
the current track is extended; (iii) if the new instantiated region matches (correlates with) 
an existing region in its (5 pixel) neighborhood then this existing region is added to the 
track; (iv) if the matched region already belongs to a track starting in the new frame, 
then the two tracks are joined. 

Figure 2 gives the ‘before and after’ histogram of track lengths, and the results of 
this repair are shown in figure 3. As can be seen, there is a dramatic improvement in the 
length of the tracks - as was the objective here. Note, the success of this method is due 
to the availability and use of two complementary constraints - motion and appearance. 
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Fig. 3. Example I: (a) 6 frames from one shot (188 frames long) of the movie ‘Groundhog Day’. 
The camera is panning right, and the van moves independently, (b) frames with the basic region 
tracks superimposed. The tracked path (x, y) position over time is shown together with each 
tracked region, (c) frames with short range repaired region tracks superimposed. Note the much 
longer tracks on the van after applying this repair. For presentation purposes, only tracks that last 
for more than 10 frames are shown, (d) One of the three dominant objects found in the shot. The 
other two are backgrounds at the beginning and end of the shot. No background is tracked in the 
middle part of the shot due to motion blur. 



4 Object Extraction by Robust Sub-space Estimation 



To achieve the final goal of identifying objects in a shot we must partition the tracks into 
groups with coherent motion. In other words, things that move together are assumed to 
belong together. For example, in the shot of figure 3 the ideal outcome would be the van 
as one object, and then several groupings of the background. The grouping constraint 
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used here is that of common (semi-)rigid motion, and we assume an affine camera model 
so the structure from motion problem reduces to linear subspace estimation. 

For a 3-dimensional object, our objective would be to determine a 3D basis of 
trajectories bj.. k = 1, 2, 3, (to span a rank 3 subspace) so that (after subtracting the 
centroid) all the trajectories x® associated with the object could be written as [20]: 

x} = (bi,b®,bi) 

where x® is the measured ( x , y) position of the jth point in frame i, and ( Xj,Yj , Zj) is 
the 3D affine structure. 

The maximum likelihood estimate of the basis vectors and affine structure could then 
be obtained by minimizing the reprojection error 

E I K ( x i - ( b i, b 2 ? b s) (Xj^zjy) || 2 (i) 

ij 

where n® is an indicator variable to label whether the point j is (correctly) detected 
in frame i, and must also be estimated. This indicator variable is necessary to handle 
missing data. 

It is well known [17] that directly fitting a rank 3 subspace to trajectories is often 
unsuccessful and suffers from over-fitting. For example, in a video shot the inter-frame 
motion is quite slow so using motion alone it easy to under-segment and group foreground 
objects with the background. 

We build in immunity to this problem from the start, and fit subspaces in two stages: 
first, a low dimensional model (a projective homography) is used to hypothesize groups - 
this over-segments the tracks. These groups are then associated throughout the shot using 
track co-occurrences. The outcome is that trajectories are grouped into sets belonging to 
a single object. In the second stage 3D subspaces are robustly sampled from these sets, 
without over-fitting, and used to merge the sets arising from each object. These steps are 
described in the following sub-sections. This approach differs fundamentally from that 
of [1,3] where robustness is achieved by iteratively re-weighting outliers but no account 
is taken of motion degeneracy. 

4.1 Basic Motion Grouping Using Homographies 

To determine the motion-grouped tracks for a particular frame, both the previous and 
subsequent frames are considered. The aim is then to partition all tracks extending over 
the three frames into sets with a common motion. To achieve this, homographies are fitted 
to each pair of frames of the triplet using RANSAC [6], and an inlying set is scored by its 
error averaged over the three homographies. The inlying set is removed, and RANSAC 
is then applied to the remaining tracks to extract the next largest motion grouping, etc. 
This procedure is applied to every frame in the shot. This provides temporal coherence 
(since neighboring triplets share two frames) which is useful in the next step where 
motion groups are linked throughout the shot into an object. 

4.2 Aggregating Segmentation over Multiple Frames 

The problem with fitting motion models to pairs or triplets of frames are twofold: phantom 
motion cluster corresponding to a combination of two independent motions grouped 
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Fig. 4. Aggregating segmentation over multiple frames, (a) The track co-occurrence matrix 
for a ten frame block of the shot from figure 3. White indicates high co-occurrence, (b) The 
thresholded co-occurrence matrix re-ordered according to its connected components (see text), 
(c) (d) The sets of tracks corresponding to the two largest components (of size 1157 and 97). The 
other components correspond to 16 outliers. 




Fig. 5. Trajectories following object level grouping. Left: Five region tracks (out of a total 429 
between these frames) shown as spatiotemporal “tubes” in the video volume. Right: A selection 
of 110 region tracks (of the 429) shown by their centroid motion. The frames shown are 68 and 
80. Both figures clearly show the foreshortening as the car recedes into the distance towards the 
end of the shot. The number and quality of the tracks is evident: the tubes are approaching a dense 
epipolar image [2], but with explicit correspondence; the centroid motion demonstrates that outlier 
‘strands' have been entirely ‘combed’ out, to give a well conditioned track set. 



together can ariser [15], and an outlying track will be occasionally, but not consistently, 
grouped together with the wrong motion group. In our experience these ambiguities 
tend not to be stable over many frames, but rather occasionally appear and disappear. 
To deal with these problems we devise a voting strategy which groups tracks that are 
consistently segmented together over multiple frames. 
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Fig. 6. Example II: object level grouping for another (35 frame) shot from the movie ‘Groundhog 
Day’. Top row: The original frames of the shot. Middle and bottom row: The two dominant 
(measured by the number of tracks) objects detected in the shot. The number of tracks associated 
with each object is 721 (car) and 2485 (background). 



The basic motion grouping of section 4.1 provides a track segmentation for each 
frame (computed using the two neighbouring frames too). To take advantage of temporal 
consistency the shot is divided into blocks of frames over a wider baseline of n frames 
{n = 10 for example) and a track-to-track co-occurrence matrix W is computed for each 
block. The element Wij of the matrix W accumulates a vote for each frame where tracks 
i and j are grouped together. Votes are added for all frames in the block. In other words, 
the similarity score between two tracks is the number of frames (within the 10-frame 
block) in which the two tracks were grouped together. The task is now to segment the 
track voting matrix W into temporally coherent clusters of tracks. This is achieved by 
finding connected components of a graph corresponding to the thresholded matrix W. 
To prevent under-segmentation the threshold is set to a value larger than half of the 
frame baseline of the block, i.e. 6 for the 10 frame block size. This guarantees that each 
track cannot be assigned to more than one group. Only components exceeding a certain 
minimal number of tracks are retained. Figure 4 shows an example of the voting scheme 
applied on a ten frame block from the shot of figure 3. This simple scheme segments the 
matrix W reliably and overcomes the phantoms and outliers. 

The motion clusters extracted in the neighbouring 10 frame blocks are then associated 
based on the common tracks between the blocks. The result is a set of connected clusters 
of tracks which correspond to independently moving objects throughout the shot. 



4.3 Object Extraction 

The previous track clustering step usually results in no more than 10 dominant (measured 
by the number of tracks) motion clusters larger than 100 tracks. The goal now is to 
identify those clusters that belong to the same moving 3D object. This is achieved by 
grouping pairs of track-clusters over a wider baseline of m frames ( m > 20 here). To test 
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Fig. 7. Example III: object level grouping for another (83 frame) shot from the movie Ground- 
hog Day. Top row: The original frames of the shot. Middle and bottom row: The two dominant 
(measured by the number of tracks) objects detected in the shot. The number of tracks associated 
with each object is 225 (landlady) and 2764 (background). The landlady is an example of a slowly 
deforming object. 



whether to group two clusters, tracks from both sets are pooled together and a RANSAC 
algorithm is applied to all tracks intersecting the m frames. The algorithm robustly fits 
a rank 3 subspace as described in equation (1). 

In each RANSAC iteration, four tracks are selected and full affine factorization is 
applied to estimate the three basis trajectories which span the three dimensional subspace 
of the (2 to dimensional) trajectory space. All other tracks that are visible in at least five 
views are projected onto the space. A threshold is set on reprojection error (measured 
in pixels) to determine the number of inliers. To prevent the grouping of inconsistent 
clusters a high number of inliers (90%) from both sets of tracks is required. When no 
more clusters can be paired, all remaining clusters are considered as separate objects. 



4.4 Object Extraction Results 

An example of one of the extracted objects (the van) is shown in figure 3d. In total, 
four objects are grouped for this shot, two corresponding to the van (before and after 
the occlusion by the post, see figure 9 in section 5) and two background objects at the 
beginning and end of the shot. The number of tracks associated with each object are 
607 (van pre-occlusion), 130 (van post-occlusion), 1808 (background start) and 2466 
(background end). The sparsity pattern of the tracks belonging to different objects is 
shown in figure 2(b). Each of the background objects is composed of only one motion 
cluster. The van object is composed of two motion clusters of size 580 and 27 which are 
joined at the object extraction RANSAC stage. The quality and coverage of the resulting 
tracks is visualized in the spatio-temporal domain in figure 5. 

A second example of rigid object extraction from a different shot is given in figure 6. 
Figures 7 and 8 show examples of slowly deforming objects. This deformation is allowed 
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Fig. 8. Example IV: object level grouping for another (645 frame) shot from the movie Groundhog 
Day. Top row: The original frames of the shot where a person walks across the room while tracked 
by the camera. Middle and bottom row: The two dominant (measured by the number of tracks) 
objects detected in the shot. The number of tracks associated with each object is 401 (the walking 
person) and 1 5,053 (background). The object corresponding to the walking person is a join of three 
objects (of size 1 14, 146 and 141 tracks) connected by a long range repair using wide baseline 
matching, see figure 9b. The long range repair was necessary because the tracks are broken twice: 
once due to occlusion by a plant (visible in frames two and three in the first row) and the second 
time (not shown in the figure) due to the person turning his back on the camera. The trajectory of 
the regions is not shown here in order to make the clusters visible. 



because rigidity is only applied over a sliding baseline of m frames, with m less than 
the total length of the track. For example we are able to track regions on slowly rotating 
and deforming face such as a mouth opening. 



5 Long Range Track Repair 



The object extraction method described in the previous section groups objects which 
are temporally coherent. The aim now is to connect objects that appear several times 
throughout a shot, for example an object that disappears for a while due to occlusion. 
Typically a set of tracks will terminate simultaneously (at the occlusion), and another 
set will start (after the occlusion). The situation is like joining up a cable (of multiple 
tracks) that has been cut. 

The set of tracks is joined by applying standard wide baseline matching [9,11,18] 
to a pair of frames that each contain the object. There are two stages: first, epsilon- 
nearest neighbor search on a SIFT descriptor [8] for each region, is performed to get a 
set of putative region matches, and second, this set is disambiguated by a local spatial 
consistency constraint: a putative match is discarded if it does not have a supporting match 
within its k-nearest spatial neighbors [12,14], Since each region considered for matching 
is part of a track, it is straightforward to extend the matching to tracks. The two objects 
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(a) (b) 



Fig. 9. Two examples of long range repair on (a) shot from figure 3 where a van is occluded (by 
a post) which causes the tracking and motion segmentation to fail, and (b) shot from figure 8 
where a person walks behind a plant. First row: Sample frames from the two sequences. Second 
row: Wide-baseline matches on regions of the two frames. The green lines show links between the 
matched regions. Third row: Region tracks on the two objects that have been matched in the shot. 



are deemed matched if the number of matched tracks exceeds a threshold. Figure 9 gives 
two examples of long range repair on shots where the object was temporarily occluded. 

6 Application: Object Level Video Matching 

Having computed object level groupings for shots throughout the film, we are now in a 
position to retrieve object matches given only part of the object as a query region. Grouped 
objects are represented by the union of the regions associated with all of the object’s 
tracks. This provides an implicit representation of the 3D structure, and is sufficient for 
matching when different parts of the object are seen in different frames. In more detail, 
an object is represented by the set of regions associated with it in each key-frame. As 
shown in figures 10 and 11, the set of key-frames naturally spans the object’s visual 
aspects contained within the shot. 

In the application we have engineered, the user outlines a query region of a key-frame 
in order to obtain other key-frames or shots containing the scene or object delineated by 
the region. The objective is to retrieve all key-frames/shots within the film containing 
the object, even though it may be imaged from a different visual aspect. 

The object-level matching is carried out by determining the set of affine invariant 
regions enclosed by the query region. The convex hull of these tracked regions is then 
computed in each key frame, and this hull determines in turn a query region for that 
frame. Matching is then carried out for all query regions using the Video Google method 
described in [14]. 

An example of object-level matching throughout a database of 5,641 key-frames of 
the entire movie ‘Groundhog Day’ is shown in figures 10 and 11. 
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Fig. 10. Object level video matching I. Top row: The query frame with query region (side of the 
van) selected by the user. Second row: The automatically associated keyframes and outlined query 
regions. Next four rows: Example frames retrieved from the entire movie ‘Groundhog Day’ by 
the object level query. Note that views of the van from the back and front are retrieved. This is not 
possible with wide-baseline matching methods alone using only the side of the van visible in the 
query image. 



7 Discussion and Extensions 

We have shown that representing an object as a set of viewpoint invariant patches has 
a number of beneficial consequences: gaps in tracks can be reliably repaired; tracked 
objects can be matched across occlusions; and, most importantly here, different view- 
points of the object can be associated provided they are sampled by the motion within a 
shot. 
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Fig. 11. Object level video matching II. Top row: The query frame with query region selected 
by the user. The query frame acts as a portal to the keyframes associated with the object by the 
motion-based grouping (shown in the second row). Note that in the associated keyframes the 
person is visible from the front and also changes scale. See figure 8 for the corresponding object 
segmentation. Next three rows: Example frames retrieved from the entire movie ’Groundhog Day’ 
by the object level query. 



We are now at a point where useful object level groupings can be computed auto- 
matically for shots that contain a few objects moving independently and semi-rigidly. 
This has opened up the possibility of pre-computing object-level matches throughout a 
film - so that content-based retrieval for images can access objects directly, rather than 
image regions; and queries can be posed at the object, rather than image, level. 
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Abstract. In this paper, we aim to recover the 3D shape of a human 
face using a single image. We use a combination of symmetric shape from 
shading by Zhao and Chellappa and statistical approach for facial shape 
reconstruction by Atick, Griffin and Redlich. Given a single frontal image 
of a human face under a known directional illumination from a side, 
we represent the solution as a linear combination of basis shapes and 
recover the coefficients using a symmetry constraint on a facial shape 
and albedo. By solving a single least-squares system of equations, our 
algorithm provides a closed-form solution which satisfies both symmetry 
and statistical constraints in the best possible way. Our procedure takes 
only a few seconds, accounts for varying facial albedo, and is simpler 
than the previous methods. In the special case of horizontal illuminant 
direction, our algorithm runs even as fast as matrix-vector multiplication. 



1 Introduction 

The problem of estimating the shape of an object from its shading was first 
introduced by Horn [1] . He defined the mapping between the shading and surface 
shape in terms of the reflectance function I xy = R(p, q) where I XtV denotes 
image intensity, p = z x and q = z y , z being the depth of the object and (x,y) 
are projected spatial coordinates of the 3D object. In this paper we will assume 
orthographic projection and Lambertian reflectance model, thus obtaining the 
following brightness constraint : 

l-pl-qk 

X ’ V PX ' V v /p2 + g2 + 1 ^2 + jfc2 + 1 * W 

where is the illuminant direction (we have here proportion, instead of 

equality, because of the light source intensity). The task of a shape from shading 
algorithm is to estimate the unknowns of Eq. (1), which are the surface albedos 
Px,m and the surface depths z XyV . With only image intensities known, estimating 
both the depths and the albedos is ill-posed. A common practice is to assume a 
constant surface albedo, but in a survey [2] it is concluded that depth estimates 
for real images come out to be very poor with this simplistic assumption. 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3022, pp. 99-113, 2004. 
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In this paper, we aim to recover the 3D shape of a human face using a single 
image. In this case there are some other constraints, that can be imposed on 
the unknown variables p xy and z x y in Eq. (1). Shimshoni et al. [3] and Zhao 
and Clrelloppa [4,5] presented shape from shading approaches for symmetric 
objects, which are applicable to human faces. In [3], geometric and photometric 
stereo are combined to obtain a 3D reconstruction of quasi-frontal face images. 
In [4,5] symmetry constraints on depth and albedo are used to obtain another, 
albedo- free brightness constraint. Using this constraint and Eq. (1) they find a 
depth value at every point. Lee and Rosenfeld [6] presented an approximate 
albedo estimation method for scene segmentation. Using this albedo estimation 
method, Tsai and Shah [7] presented a shape from shading algorithm, which 
starts with segmenting a scene into a piece-wise constant albedo regions, and 
then applies some shape from shading algorithm on each region independently. 
Nandy and Ben-Arie [8,9] assume constant facial albedo and recover facial depth 
using neural networks. All the methods mentioned above assume either constant 
or piece-wise constant albedo, which is not a realistic assumption for real 3D 
objects such as human faces. Finally, Ononye and Smith [10] used color images 
for 3D recovery and obtained good results for simple objects. Results which 
they showed on unrestricted facial images are not so accurate, however, probably 
because of the lack of independence of the R , G, B channels in these images. 

Atick et al. [11] and Vetter et al. [12,13,14,15] provided statistical shape 
from shading algorithms, which attempt to reconstruct the 3D shape of a 
human face using statistical knowledge about the shapes and albedos of 
human faces in general. In [11], a constant facial albedo is assumed and 
a linear constraint on the shape is imposed. The authors of [12,13,14,15], 
went a step further and have dropped the constant albedo assumption, im- 
posing linear constraint on both texture (albedo map) and shape of the 
face. Because facial texture is not as smooth in general as the facial shape, 
imposing linear constraint on the texture requires a special preprocessing 
stage to align the database facial images to better match each other. Both ap- 
proaches use certain optimization methods to find the coefficients of the linear 
combinations present in the linear constraints they are using, thus providing no 
closed-form solution, and consuming significant computational time. 

We present an algorithm which accounts for varying facial albedo. Our me- 
thod provides a closed-form solution to the problem by solving a single least- 
squares system of equations, obtained by combining albedo-free brightness and 
class linearity constraints. Our approach requires a restrictive setup: frontal face 
view, known directional illumination (that can be estimated for example by 
Pentland’s method [16]) and Lambertian assumption about the face. We also 
get some inaccuracies in the reconstructed faces because human faces are not 
perfectly symmetric [17]. However, this is the first algorithm for 3D face recon- 
struction from a single image, which provides a closed- form solution within a 
few seconds. 

The organization of the paper is as follows. In Sect. 2, we describe in detail 
previous work that is relevant to our approach, mainly symmetric and statistical 
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shape from shading algorithms. Later, we present our algorithm in Sect. 3, and 
give experimental results in Sect. 4. Finally, we draw conclusions in Sect. 5. 



2 Previous Work 



2.1 Symmetric Shape from Shading 

Zhao and Clrellappa [4,5] introduced symmetric shape from shading algorithm 
which uses the following symmetry constraint : 



Px,y P—x,yi . y Z—x,y 



(2) 



to recover the shape of bilaterally symmetric objects, e.g. human faces. From 
brightness and symmetry constraints together the following equations follow: 



I 



-x,y 



1 + pi — qk 

OC Pr v , — . 

sjp 1 + q 2 + 1 VI 2 + k 2 + 1 



I—x,y *x,y 
I—x,y Y *x,y 



Pi 

1 — qk 



( 3 ) 

( 4 ) 



Denoting D xy = I- X , y ~ I x , y and &x, y = I-x, y + I x , y we obtain the following 
albedo free brightness constraint: 



Slp + Dkq = D. (5) 

Using Eq. (5), Zhao and Clrellappa write p as a function of < 7 , and substitute 
it into Eq. (1), obtaining equation in two unknowns, q and the albedo. They 
approximate the albedo by a piece- wise constant function and solve the equation 
for q. After recovering q and then p, the surface depth z can be recovered by any 
surface integration method, e.g., the one used in [18]. 

Yilmaz and Shah [19] tried to abandon the albedo piece-wise constancy as- 
sumption by solving Eq. (5) directly. They wrote Eq. (5) as an equation in z x<y , 
instead of p and q , and tried to solve it iteratively. A linear partial differential 
equation 

@’X, y Z X Y b x ,yZy = C X ,yi (6) 

of the same type in z, in different context, appears in the linear shape from 
shading method of Pentland [20], and was used also in [21]. 

Pentland tried to solve Eq. ( 6 ) by taking the Fourier transform of both sides, 
obtaining 

Au,v(. in) Z u .v iv}Z u ^ v — (T) 

where A, B , C and Z stand for the Fourier transforms of a, 6 , c and z, respectively. 
It was stated in [20], that Z UtV can be computed by rearranging terms in Eq. (7) 
and taking the inverse Fourier transform. However, rearranging terms in Eq. (7) 
results in Z UtV = iC UtV / (A UtV u+ B u ^ v v) . This equation is undefined when A u ^ v u+ 
B U v v vanishes, and thus it leaves ambiguities in Z U}Vl and therefore also in z XtV . 
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As was noted in [22], but was noticed neither in [19] nor in [20,21], Eq. (6), 
being a linear partial differential equation in z, can be solved only up to a one 
dimensional ambiguity (hidden in the initial conditions), for example via the 
characteristic strip method [23]. Hence, Eq. (5) cannot be solved alone, without 
leaving large ambiguities in the solution. 



2.2 Statistical Shape from Shading 

Atick et al. [11] took a collection of 200 heads where each head was scanned 
using a CyberWare Laser scanner. Each surface was represented in cylindrical 
coordinates as r*(0, /i),l < t < 200 with 512 units of resolution for 6 and 256 
units of resolution for h. After cropping the data to a 256 x 200 (angular x 
height) grid around the nose, it was used to build eigenhead decomposition as 
explained below. They took ro(9,h) to be the average of 200 heads and then 
performed principal component analysis [24] obtaining the eigenfunctions tfc 
such that any other human head could be represented by a linear combination 
r(9, h) = ro{0, h) + Xu=i h). After that they applied conjugate gradient 

optimization method to find the coefficients /?*, such that the resulting r(9, h) 
will satisfy the brightness constraint in Eq. (1), assuming constant albedo. 

In [12,13,14,15], the authors went a step further and have dropped the con- 
stant albedo assumption, imposing linear constraint on both texture (albedo 
map) and shape of the face. In order to achieve shape and texture coordinate 
alignment between the basis faces, they parameterized each basis face by a fixed 
collection of 3D vertices (. Xj,Yj , Zj), called a point distribution model, with as- 
sociated color (Rj, Bj,Gj) for each vertex. Enumeration of the vertices was the 
same for each basis face. They modelled the texture of a human face by eigenfa- 
ces [25] and its shape by eigenheads, as described above. They recovered sets of 
texture and shape coefficients via complex multi-scale optimization techniques. 
These papers also treated non-Lambertian reflectance, which we do not treat 
here. 

Both statistical shape from shading approaches described above do not use 
the simple Cartesian (x, y) i->- z(x, y ) parameterization for the eigenhead sur- 
faces. While the parameterizations described above are more appropriate for 
capturing linear relations among the basis heads, they have the drawback of 
projecting the same head vertex onto different image locations, depending on 
the shape coefficients. Thus, image intensity depends on the shape coefficients. 
To make this dependence linear, the authors in [11] used Taylor expansion of 
the image /, with approximation error consequences. In [12,13,14,15], this depen- 
dence is taken into account in each iteration, thus slowing down the convergence 
speed. 

3 Statistical Symmetric Shape from Shading 

We use in this paper a database of 138 human heads from the USF Human-ID 
3D Face Database [26] scanned with a CyberWare Laser scanner. Every head, 
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Fig. 1 . (a) Standard deviation surface 2 std . Note high values of z sti around the nose 
for example: this is problem of Cartesian representation - high 2 variance in areas 
with high spatial variance, like area around the nose, (b)-(f) The five most significant 
eigenheads z 1 , . . . ,z 5 , out of the 130 eigenheads. 



originally represented in the database in cylindrical coordinates, is resampled 
to a Cartesian grid of resolution 142 x 125. Then, we have used a threshold on 
the standard deviation z std of the facial depths, in order to mask out pixels 
( x , y ) which are not present in all the basis faces, or alternatively are unreliable 
(see Fig. 1 (a) for the masked standard deviation map). After that we perform 
PCA on the first 130 heads, obtaining the 130 eigenheads z 1 (first five of which 
are shown in Fig. 1 (b)-(f)), and keeping the remaining eight heads for testing 
purposes. We then constrain the shape of a face to be reconstructed, z x>y , to the 
form: 

130 

z x,y — ’ S ^ J OL i z x,yi (®) 

i = 1 

for some choice of coefficients {a;}. 

Since our face space constraint (8) is written in a Cartesian form, we can take 
derivatives w.r.t. x and y of both sides to obtain face space constraints on p, q : 

130 130 

p=^2 aipl ’ q = ^2 a ' ql ' ( 0 ) 

i= 1 i= 1 

where p l = z l x and q l = z l y . The two equations above, together with the albedo 
free brightness constraint (5) result in the following equation chain: 

130 130 130 

D = Sip + Dkq = SI a iP l + Dk aiqi = J2( Slpi + Dkqi ) a i- ( 10 ) 

i— 1 i= 1 2=1 
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This equation is linear in the only unknowns a 1 < i < 130. We find a least- 
squares solution, and then recover 2 using Eq. (8). One can speed up calculations 
by using less than 130 eigenheads in the face space constraint (8). 





Fig. 2. Testing of generalization ability of our eigenhead decomposition, (a) One of the 
eight out-of-sample heads not used to construct the face space, (b) Projection of surface 
in (a) onto the 130-dimensional face space, (c) Error surface. Note some correlation with 
standard deviation surface from Fig. 1. (d) Plot of dependance of the reconstruction 
error on the number of modes used in the representation. The error is defined as 
^actual _ ^.estimated | _ cos <j slant factual ^ avera g e q over a p points on the surface and over 

eight out-of-sample heads, and is displayed as a percentage from the whole dynamic 
range of 2 . Error is normalized by the cosine of the slant angle, to account for the fact 
that the actual distance between two surfaces 2 actual and 2 estlmated j s approximately 
the distance along the 2 -axis times the cosine of the slant angle of the ground truth 
surface. 



The choice of Cartesian coordinates was necessary for obtaining face space 
constraints on p and q in Eq. (9). Although Cartesian parameterization is less 
appropriate for eigenhead decomposition than cylindrical or point distribution 
model parameterizations, it provides eigenhead decomposition with sufficient ge- 
neralization ability as is shown in Fig. 2 - in reconstruction example (a)-(c) we 
see relatively unnoticeable error, and in (d) we see a rather fast decay of the 
generalization error when the number of basis heads increases. Dividing the ge- 
neralization error with the first eigenhead only (which is just the average head) 
by the generalization error with 130 eigenheads we obtain the generalization qua- 

lity 1 1 ^actual -.estimated 1 1 = 5.97 (here z m stands for the average of the first 130 heads, 

and 2 actual with ^ estlmated stand for the true and estimated depths, respectively), 
while [11] achieves with cylindrical coordinates a generalization quality of about 
10. We provide in Sect. 3.3 a recipe for improving the generalization ability of the 
model, albeit without giving empirical evidence for it. The generalization errors 
we have right now are insignificant, comparing to errors in the reconstructions 
themselves, so that they do not play a major role in the accuracy of the results. 
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3.1 Special Case of k = 0 

In the case of k = 0, Eq. (10) has a simpler form, 

130 

srS>- < u > 

i=l 

Setting {p 1 } to be the principal components of the x derivatives of the facial 
depths, and z l accordingly to be the linear combinations of the original depths 
with the same coefficients which are used in p l , one could solve for a’s by setting 
a* = (this is « 150 times faster than in our general method; in fact, this 

is as fast as a matrix- vector multiplication), and for z via Eq. (8). This could be 
done up to a scaling factor, even without knowledge of light source direction. 

Because different heads, which were used to build the face space, have face 
parts with different y coordinates, the face space has some vertical ambiguities. 
This means that for a given face in the face space, also some face with y shifts 
of a face components is also in the face space. In contrast to the general case, 
where ambiguity is driven by PDE’s characteristic curves, and therefore is pretty 
random, in the case of k = 0 ambiguity is in vertical direction and thus is going in 
resonance with ambiguity in the face space. So, in the case of k = 0, final solution 
also will contain vertical ambiguities of the type z x>y = z° y + A(y) (where A(y) is 
the ambiguity itself). We will show empirically, that due to vertical ambiguities 
in the the face space, results turn out to be quite inaccurate in this special case. 

3.2 Extending Solution’s Spatial Support 

All basis heads used in build-up of the face space have different x, y support, and 
spatial support of the eigenlreads is basically limited to the intersection of their 
supports. Therefore spatial support of the reconstructed face area is also limited. 
To overcome this shortcoming of our algorithm, one can fit surface parameterized 
via a point distribution model (PDM) [12,13,14,15] to our solution surface z = 
z x , y - This fitting uses our partial solution surface for 3D reconstruction of the 
whole face, and is straightforward, as opposed to fitting of PDM to a 2D image. 



3.3 Improving Generalization Ability 

In the Cartesian version of the face space (8) eigenlreads do not match each other 
perfectly. For example noses of different people have different sizes in both x, y 
and 2 directions. A linear combination of two noses with different x , y support 
produces something which is not a nose. Suppose now that we have a certain 
face with certain x, y support for its nose. In order to get this nose we need basis 
faces with noses of similar x, y support to this nose. This means that only a few 
basic faces will be used in a linear combination to produce this particular nose. 
This observation explains why a Cartesian version of face space has the highest 
generalization error among all face space representations: other representations 
have a better ability to match the supports of different face parts such as noses. 
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In order to overcome the drawbacks mentioned above, of the Cartesian prin- 
ciple component decomposition, we suggest another Cartesian decomposition for 
the shape space of human faces. Using NMF [27] or ICA [28,29], it is possible to 
decompose the shape of a human face as a sum of its basic parts z 1 , 1 < i < 130. 
In contrast to the principal component analysis, each z l will have a compact x, y 
support. Therefore, from each z l several Zj can be derived, which are just slightly 
shifted and/or scaled copies of the original z l , in the x,y plane, in random di- 
rections. Using these shifted (and/or scaled) copies of original z l , 1 < i < 130, a 
broader class of face shapes can be obtained by a linear combination. 

An alternative method for improving the generalization quality of the face 
space is to perform a spatial alignment of DB faces via image warping using 
for example manually selected landmarks [30]. For a progressive alignment it is 
possible to use eigenfeatures, which can be computed from the aligned DB faces. 



4 Experiments 

We have tested our algorithm on the Yale face database B [31]. This database 
contains frontal images of ten people from varying illumination angles. Align- 
ment in the x, y plane between the faces in the Yale database and the 3D models 
is achieved using spatial coordinates of the centers of the eyes. We use eye coor- 
dinates of people from the Yale database, which are available. Also we marked 
the eye centers for our z m and used it for the alignment with the images. 

For every person in the Yale face database B, there are 63 frontal facial views 
with different known point illuminations stored. Also for each person its ambient 
image is stored. As in this paper we do not deal with the ambient component we 
subtract the ambient image from each one of the 63 images, prior to using them. 
As we need ground truth depths for these faces for testing purposes, and it is 
not given via a laser scanner for example, we first tried to apply photometric 
stereo on these images, in order to compute the depths. However, because the 63 
light sources have different unknown intensities, performing photometric stereo 
is impossible. Hence, we have taken a different strategy for “ground truth” depth 
computation. 

We took two images, taken under different illumination conditions, for each 
one of the ten faces present in the database. A frontal image F with zero azi- 
muth Az and elevation El, and image I with azimuth angle Az = 20° and 
elevation angle El = 10°. Substituting l = — tan 20° and k = tan 10°/ cos 20° 
into Eq. (1), we obtain a Lambertian equation for the image I. Doing the same 
with l = 0 and k = 0 we obtain a Lambertian equation for the image F. Dividing 
these two equations, we obtain the following equation, with l = — tan 20° and 
k = tan 10°/ cos 20° (we have here equality up to scaling factor A caused by a 
difference of light source intensities used to produce the images I and F): 

\^L = l-pl-qk. (12) 

-Fx,y 
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Table 1. Quality and computational complexity of our algorithm, applied to all the ten 
Yale faces. Ten columns correspond to the ten faces. The first row contains asymmetry 
estimates of the faces (we subtract the lowest element). The second row contains the 
quality estimates, measured via an inverse normalized distance between the estimated 
and actual depths. The third row contains the quality estimates with statistical Carte- 
sian shape from shading (assuming constant albedo as in the work of Atick et al. [11]). 
The last row contains running times (in seconds) of our algorithm on all ten faces 



Asymmetry 


0.53 


1.08 


0 


0.52 


1.36 


0.87 


0.75 


0.68 


0.65 


0.38 


Quality 


1.81 


1.78 


3.39 


1.92 


1.91 


1.29 


1.96 


2.59 


1.45 


2.94 


Const Albedo 


1.27 


1.24 


1.76 


1.41 


1.57 


1.10 


2.15 


2.5 


1.5 


2.35 


Running Time 


1.76 


1.76 


1.76 


1.76 


2.21 


1.76 


1.75 


1.76 


1.75 


1.76 



Eqs. (12) and (9) together enable us to get the “ground truth” depth, given the 
scaling factor A (which we will show how to calculate, later on): 

130 

Fx,y - A4,y = ^2( F X , V IP 1 + Fx,yW)oti- (13) 

i=l 

Of coarse such a “ground truth” is less accurate than one obtained with 
photometric stereo, because it is based on two, rather than on all 63 images, 
and because it uses the face space decomposition, which introduces its own 
generalization error. Also small differences between the images I and F cause 
small errors in the resulting “ground truth” . Still this “ground truth” has major 
advantage over results obtained by our symmetric shape from shading algorithm, 
which is that it does not use an inaccurate symmetry assumption about the 
faces [17]. 

Our algorithm estimates the depths of each one of the ten faces by solving 
Eq. (10). Then we take the estimated oti s and plug them into Eq. (13). We find 
the best A satisfying this equation and use it in the “ground truth” calculation. 
This mini-algorithm for A calculation is based on the fact that our estimation, 
and the ground truth are supposed to be close, and therefore our estimation can 
be used to reduce ambiguity in the “ground truth” solution. As A has some error, 
we need to perform a small additional alignment between the “ground truth” 
and the estimated depths. We scale the “ground truth” solution at the end, so 
that it will have the same mean as the estimated depth. 

In Table 1 we provide asymmetry estimates for all ten Yale DB faces along 
with quality and computational complexity estimates of the results. Asymmetry 
is measured via a Frobenius distance between normalized (to mean gray level 1) 
frontally illuminated face F and its reflect R. Quality is measured by the fraction 
| factual 1 1 ^| factual _ ^.estimated 1 ^ anc j computational complexity measured by 

a running time, in MATLAB, on a Pentium 4 1600MHz computer. Correlation 
coefficient between the facial asymmetry and resulting quality estimates is -0.65, 
indicating a relatively strong anti-correlation between quality and asymmetry. 
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Table 2. Quality of results of our algorithm for the case that k — 0, on all the ten Yale 
faces. The ten columns correspond to the ten faces. We performed quality estimation 
for reconstructions from images with Az = —10° and Az = —25°. Prior to estimating 
the quality, we shifted and stretched the estimated depths, so that they will have the 
same mean and variance as the “ground truth” depths. We estimated quality with and 
without, row normalization, in which each depth row to be estimated was normalized 
to have the same mean as its counterpart row in the “ground truth” depth 



! \ 


Vithout normalization (true quality) 




Az = -10° 


0.70 


0.96 


0.88 


1.25 


0.76 


0.90 


1.06 


1.63 


0.81 


1.20 


Az = -25° 


0.86 


1.24 


1.14 


1.21 


0.91 


0.68 


1.29 


1.81 


0.92 


1.50 



| 1 


Yith normalization (additional test) 


Az = -10° 


1.34 


1.63 


1.58 


1.69 


1.52 


1.11 


1.46 


2.92 


1.54 


2.07 


Az = -25° 


1.70 


2.20 


2.27 


1.93 


2.12 


0.90 


2.08 


3.14 


1.88 


2.27 



Results with Cartesian face space, replacing the symmetry with the constant 
albedo assumption (thus simulating the work of Atick et al. [11]), are found via 
iterative optimization on the ctj’s according to Eq. (1), initialized by our solution. 
For each database image we chose its best scale to fit into the Lambertian equa- 
tion, thus making the statistical shape from shading results as good as possible 
for a faithful comparison. Our results are slightly better on most faces (Table 1). 

The three best results of our algorithm, with quality at least 2.5 (faces 3,8 
and 10 in the Yale ordering), are depicted in Fig. 3 (along with their statistical 
SFS counterparts). Also, in the first three rows of Fig. 4, we show textured faces 
(with texture being the image of frontally illuminated face), rendered with our 
“ground truth”, estimated and average depths. In the first three rows of Fig. 5, 
we render these faces as if they were shot using frontal illumination, by taking 
images with Az = 20° and El = 10° and cancelling out side illumination effect 
by dividing them by 1 — Ip — kq, where p and q are recovered by our algorithm 
from the image I and are given directly (without using z) by Eq. (9). 

In the last row of Fig. 4, we show results for the face number 6 in the Yale 
database, which has the worst reconstruction quality (see Table 1). In the last row 
of Fig. 5, we show renderings for this face. One can note a significant asymmetry 
of the face, which explains the rather bad reconstruction results in Fig. 4. We 
provide, in the additional material, results of the algorithm on all ten Yale faces. 

We attribute inaccurate results of our algorithm, on many faces, to facial 
asymmetry. Results in Fig. 5 can be compared with similar results by Zhao and 
Clrellappa [4,5] (see Figs. 14 and 15 in [4]). Note that both are affected by facial 
asymmetry. Using some illumination invariant feature matcher [32], features on 
two sides of a face, could be matched based solely on the albedo, and warp the 
face to one with symmetric shape and texture, but a warped illumination (with 
less impact on errors, due to smoothness of the illumination). However, we doubt 
whether this approach is feasible, because of the matching errors. 
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Fig. 3. Three columns correspond to three different faces from the Yale face database B. 
First row contains meshes of the faces with “ground truth” depths. Second row contains 
the meshes reconstructed by our algorithm from images with lighting Az = 20° and 
El = 10°. Third row contains reconstructions from statistical Cartesian SFS algorithm. 



4.1 Results in the Case of k — 0 

In the special case of zero light source elevation angle El , we took two images 
with different light source azimuth angle Az, for each one of the ten faces present 
in the database. One image with El = —10° and the other with El = —25°. Our 
algorithm estimates depths of each one of the ten faces by solving Eq. (11). We 
shift and stretch the estimated depths, so that they will have the same mean 
and variance as the “ground truth” depths. In first part of Table 2, we provide 
quality estimates of the results of our special case algorithm, on all the ten faces. 

We have done further alignment between the estimated and “ground truth” 
depths. We have normalized all the rows of the estimated depths to have the 
same mean as their counterpart rows in the “ground truth” depths. Thereafter 
we have measured the quality estimates of the results, and presented them in 
the second part of Table 2. One can note a significant increase in estimates, 
relatively to the first part of Table 2, which is an indication of a significant 
ID ambiguity which is left in the solution of Eq. (11). Quality estimates in the 
second part of Table 2 are comparable with those of our main results in Table 1. 

5 Conclusions 

In this paper we have presented a successful combination of two previous facial 
shape reconstruction approaches - one which uses symmetry and one which uses 
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Fig. 4. Rows correspond to different Yale faces. First, second and third columns con- 
tain renderings of the faces with “ground truth”, reconstructed and average depths, 
respectively. For the first three faces, texture matches rather well both “ground truth” 
and estimated depths, but poorly the average depth (mainly in the area between nose 
and mouth), indicating good shape estimation of our algorithm, at least for these faces. 



statistics of human faces. Although our setup in this paper is rather restrictive 
and results are inaccurate on many faces, still our approach has a major ad- 
vantages over the previous methods - it is very simple, provides a closed-from 
solution, accounts for facial nonuniform albedo and has extremely low computa- 
tional complexity. The main disadvantage of the algorithm is inaccurate results 
on some faces caused by asymmetry of these faces. On most faces, however, we 
obtain reconstructions of sufficient quality for creation of realistically looking 
new synthetic views (see new geometry synthesis in Fig. 4 and new illumination 
synthesis in Fig. 5). In general, synthesizing views with new illumination does 
not require very accurate depth information, so that our algorithm can be con- 
sidered appropriate for this application because of its simplicity and efficiency. 



Acknowledgements. We are grateful to Prof. Sudeep Sarkar, University of 
South Florida, for allowing us to use the USF DARPA HumanID 3D Face Data- 
base for this research. Research was supported in part by the European Commu- 
nity grant number IST-2000-26001 and by the Israel Science Foundation grants 
number 266/02. The vision group at the Weizmann Inst, is supported in part by 
the Moross Laboratory for Vision Research and Robotics. 




Statistical Symmetric Shape from Shading 111 




Fig. 5. The four different rows correspond to four different faces from the Yale face 
database B. The first column contains renderings of faces with side illumination Az = 
20° and El = 10°. The second column contains images rendered from images in column 
1 using the depth recovered by our algorithm. The faces in the second column should be 
similar to the frontally illuminated faces in column 3 (one should ignore shadows present 
in the rendered images, because such a simple cancellation scheme is not supposed to 
cancel them out). Finally, the last column contains frontally illuminated faces from 
column 3, flipped around their vertical axis. By comparing two last columns, we can 
see noticeable facial asymmetry, even in the case of the three best faces. For the fourth 
face asymmetry is rather significant, specially depth asymmetry near the nose, causing 
rather big errors in the reconstructed depth, as can be seen in Fig. 4. 
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Abstract. We consider the problem of estimating the shape and 
radiance of a scene from a calibrated set of images under the assumption 
that the scene is Lambertian and its radiance is piecewise constant. 
We model the radiance segmentation explicitly using smooth curves on 
the surface that bound regions of constant radiance. We pose the scene 
reconstruction problem in a variational framework, where the unknowns 
are the surface, the radiance values and the segmenting curves. We 
propose an iterative procedure to minimize a global cost functional that 
combines geometric priors on both the surface and the curves with a 
data fitness score. We carry out the numerical implementation in the 
level set framework. 

Keywords: variational methods, Mumford-Shah functional, image seg- 
mentation, multi-view stereo, level set methods, curve evolution on ma- 
nifolds. 
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Fig. 1. (COLOR) A plush model of “nemo.” The object exhibits piecewise constant 
appearance. From a set of calibrated views, our algorithm can estimate the shape and 
the piecewise constant radiance. 



1 Introduction 

Inferring three-dimensional shape and appearance of a scene from a collection 
of images has been a central problem in Computer Vision, known as multi- 
view stereo. Traditional approaches to this problem first match points or small 
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Fig. 2. Man-made objects often exhibit piecewise constant appearance. Approximating 
their radiances with smooth functions would lead to gross error and “blurring” of the 
reconstruction. On the other hand, these objects are not textured enough to establish 
dense correspondence among different views. However, we can clearly see radiance 
boundaries that divided the objects into constant regions. 



image regions across different views and then combine the matches into a three- 
dimensional model 1 . Scene radiances can be reconstructed afterwards if neces- 
sary. These approaches effectively avoid directly estimating the scene radiances, 
which can be quite complex for real scenes. However, for these methods to work, 
the scene has to satisfy quite restrictive assumptions, namely the radiance has to 
contain “sufficient texture.” When the assumptions are not fulfilled, traditional 
methods fail. Recently, various approaches have been proposed to fill the gaps 
where the assumptions underlying traditional stereo methods are violated and 
the scene radiance is assumed to be smooth, for instance [1,2]. In this case, radi- 
ance is modeled explicitly rather than being annihilated through image-to-image 
matching. The problem of reconstructing shape and radiance is then formula- 
ted as a joint segmentation of all the images. The resulting algorithms are very 
robust to image noise. 

However, there are certainly many scenes whose radiances are not heavily 
textured, but are not smooth either. For instance, man-made objects are often 
built with piecewise constant material properties and therefore exhibit approxi- 
mately piecewise constant radiances, for instance those portrayed in Figure 2. 
For scenes like these, neither the assumption of having global constant or smooth 
radiances is satisfied, nor their radiances are textured enough to establish dense 
correspondence among different views. 

For such scenes, one may attempt to use the algorithms designed for smooth 
radiances to reconstruct pieces of the surface that satisfy the assumptions and 
then patch them together to get the whole surface. Unfortunately, this approach 
does not work because patches are not closed surfaces, but even if they were, indi- 
vidual patches would not be able to explain the image data due to self-occlusions 
(see for instance Figure 4). Therefore, more complete and “global” models of the 
scene radiance are necessary. Our choice is to model it as a piecewise constant 
function. Under this choice, we can divide the scene into regions such that each 
region supports a uniform radiance, and the radiance is discontinuous across re- 
gions. The scene reconstruction problem now amounts to estimating the surface 
shape, the segmentation of the surface and the radiance value of each region. 



1 There are of course exceptions to this general approach, as we will discuss shortly. 




116 



H. Jin, A.J. Yezzi, and S. Soatto 



1.1 Relation to Prior Work and Contributions 

This work falls in the category of multi-view stereo reconstruction. The literature 
on this topic is too large to review here, so we only report on closely related 
work. Faugeras and Keriven [3] were the first to combine image matching and 
shape reconstruction in a variational framework and carry out the numerical 
implementation using level set methods [4]. The underlying principle of their 
approach is still based on image-to-image matching and therefore their algorithm 
works for scenes that contain significant texture. Yezzi et. al. [1] and Jin et. al. 
[2] approached the problem by modeling explicitly a (simplified) model of image 
formation, and reconstruct both shape and radiance of the scene by matching 
images to the underlying model, rather than to each other directly. The class 
of scenes they considered is Lambertian with constant or smooth radiances. 
In this paper, we extend their work by allowing scenes to have discontinuous 
radiances and model explicitly the discontinuities. In the work of shape carving 
by Kutulakos and Seitz [5] , matching is based on the notion of photoconsistency, 
and the largest shape that is consistent with all the images is recovered. We use a 
different assumption, namely that the radiance is piecewise constant, to recover a 
different representation (the smoothest shape that is photometrically consistent 
with the data in a variational sense) as well as photometry. Since we estimate 
curves as radiance discontinuities, this work is related to stereo reconstruction of 
space curves [6,7]. The material presented in this paper is also closely related to 
a wealth of contributions in the field of image segmentation, particularly region- 
based segmentation, starting from Mumford and Shah’s pioneering work [8] and 
including [9,10]. 

We use curves on surfaces to model the discontinuities of the radiance. We 
use level set methods [4] to evolve both the surface and the curve to perform 
optimization. Our curve evolution scheme is closely related to [11,12,13]. 

We address the problem of multi-view stereo reconstruction for Lambertian 
objects that have piecewise constant radiances. To the best of our knowledge 
we are the first to address this problem. Our solution relies not on matching 
image-to-image, but on matching all images to the underlying model of both 
geometry and photometry. 

For scenes that satisfy the assumptions, we reconstruct (1) the shape of the 
scene (a collection of smooth surfaces) and the radiance of the scene, which 
includes (2) the segmentation of the scene radiance, defined by smooth curves, 
and (3) the radiance value of each region. Our implementation contains several 
novel aspects, including simultaneously evolving curves (radiance discontinuities) 
on evolving surfaces, both of which are represented by level set functions. 

2 A Variational Formulation 

We model the scene as a collection of smooth surfaces and a background. We 
denote collectively with ScE 3 all the surface, i.e., we allow S to have multiple 
connected components. We denote with X = [X, Y, Z] T the coordinates of a ge- 
neric point on S with respect to a fixed reference frame. We assume to be able 
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to measure n images of the scene Ii : 12* — > R, i = 1,2, . . . ,n, where is the 
domain of each image with area element dfii . Each image is obtained with a 
calibrated camera which, after pre-processing, can be modeled as an ideal per- 
spective projection 7r,; : R 3 — > 17;; X i-»- x* = 7r,;(X) = 7r(X;) = [Xi/Zi, Fj/Z,;] r , 
where X, = [X,, Yj, Z,] T are the coordinates for X in the i-th camera reference 
frame. X and X t are related by a rigid body transformation, which can be re- 
presented in coordinates by a rotation matrix Ri £ SO( 3) 3 and a translation 
vector T; £ R 3 , so that X, = /?,,X + T^. We assume that the background, denoted 
with B , covers the field of view of every camera. Without loss of generality, we 
assume B to be a sphere with infinite radius, which can therefore be represented 
using angular coordinates 0 £ R 2 . We assume that the background supports a 
radiance function h : B — > R and the surface supports another radiance function 
p : S — > R. We define the region Q t = tt.^S) C 12, and denote its complement by 

Qi = Vi\Qi. 

Our assumption is that the foreground radiance p is a piecewise constant fun- 
ction. We refer the reader to [14] for an extension to piecewise smooth radiances. 
For simplicity, the background radiance h is assumed to be constant, although 
extensions to smooth, piecewise constant or piecewise smooth functions can be 
conceived. Furthermore, we assume that the discontinuities of p can be mode- 
led as a smooth closed curve C on the surface S, and C partitions S into two 
regions D\ and D 2 such that D\ U £>2 = S. Note that we allow each region 
Di to have multiple disconnected components. Extensions to more regions are 
straightforward, for instance following the work of Vese and Chan [15]. We can 
thus re-define p as follows: 

p(X) = pi £ R for X e D it i = 1, 2. (1) 

We denote with 7 t,;(£>i) and with ^(£> 2 ) the projections of D\ and £>2 in the 
i-th image respectively. 

2.1 The Cost Functional 

The task is to reconstruct S, C, p\ , P 2 , and h from the data Ii,i = 1,2 ,n. In 
order to do so, we set up a cost, Edata , that measures the discrepancy between the 
prediction of the unknowns and the actual measurements. Since some unknowns, 
namely the surface S and the curve C. live in infinite-dimensional spaces, we need 
to impose regularization to make the inference problem well-posed. In particular, 
we assume that both the surface and the curve are smooth (geometric priors), 
and we leverage on our assumption that the radiance is constant within each 
domain. The final cost is therefore the sum of three terms: 

E(S, C, pl,P2,h) = Edata + OiE surf 4“ @E curv , (2) 

2 More precisely, measured images are usually non-negative discrete functions defined 
on a discrete grid and have minimum and maximum values. For ease of notation, we 
will consider them to be defined on continuous domains and take values from the 
whole real line. 

3 SO( 3) = {R | R £ K 3x3 s.t. R t R = I and det(R) = 1}. 
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where a, /? £ M + control the relative weights among the terms. The data fitness 
can be measured in the sense of C 2 as: 



E, 



data 




(life) - Pi) (Mi + i 
rti(Di) j TCi{D2, 



(life) - P2) 2 dfii 



+E 



[ (life) - h) 2 dfii , 
JQt 



(3) 



although other function norms would do as well. The geometric prior for S is 
given by the total surface area: 



E SU rf = d A, (4) 

Js 

and that for C is given by the total curve length: 

E C urv — I ds, (5) 

JC 

where dA is the Euclidean area form of S and s is the arc-length parameterization 
for C. Therefore, the total cost takes the expression: 



Etotai(E, C , pi, P‘2 1 h) — Ed a t a -(- aE sur f T ( 3 E curv 

El f ( J ife) - pifdQi + f (life) - p 2 y 

i—l \ JlTi(Dl) d-K i (D 2 ) 



df2j 



n „ 

+ E! / (lifei) ~ h) 2 df2i + 

i—l J Q°i 



a / dA + P ds. 

Js Jc 



( 6 ) 



This functional is in the spirit of the Mumford-Sliah functional for image seg- 
mentation [8]. 



3 Optimization of the Cost Functional 

In order to find the surface S, the radiances pi,p 2 ,h and the curve C that 
minimize the cost (6), we set up an iterative procedure where we start from a 
generic initial condition (typically a big cube, sphere or cylinder) and update 
the unknowns along their gradient directions until convergence to a (necessarily 
local) minimum. 

3.1 Updating the Surface 

The gradient descent flow for the surface geometric prior is given by St = 2 kN, 
where n is the mean curvature and N is the unit normal to S. Note that we have 
kept 2 in the expression in order to have the weights in the final flow match the 
weights in the cost (2). To facilitate computing the variation of the rest terms 
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with respect to the surface, we introduce the radiance characteristic function <fi 
to describe the location of C for a given surface S. We define : S — > K such 
that 

£>i = {X | <XX) > 0}, D 2 = {X | «£(X) < 0}, and C = {X | 0(X) = 0}. 

(7) 

<f> can be viewed as the level set function of C. However, one has to keep in 
mind that <f> is defined on S. We can then express the curve length as f c ds = 
f s \\S7 sdi{(t>)\\dA where R is the Heaviside function. We prove in the technical 
report [14] that the gradient descent flow for the curve smoothness term has the 
following expression 



S t = 



HVs^ll 



II(Vs0 x N)N, 



(8) 



where n(t) denotes the second fundamental form of a vector t £ Tp(S), i.e. 
the normal curvature along t for ||t|| = 1 and S denotes the one-dimensional 
Dirac distribution: 5 = H. T P (S) is the tangent space for S at P. Note that 
Vs</> x N _L IV and therefore Vs<^ x N £ Tp(S). Since flow (8) involves 5 (</)), it 
only acts on the places where <f> is zero, i.e., the curve C. To find the variation of 
the data fitness term with respect to S, we need to introduce two more terms. Let 
Xi ■ S — > R. be the surface visibility function with respect to the i-tli camera, i.e. 
%j(X) = 1 for points on S that are visible from the i-th camera and Xj(X) = 0 
otherwise. Let cr, be the change of coordinates from dfii to dA , i.e, at = = 

(Xj, Ni) / Zf, where N t the unit normal N expressed in the i-tlr camera reference 
frame. We now can express the data term as follows (see the technical report 
[14] for more details): 

vf [ (Ii(*i) ~ plfdQt + [ (7i(xj) - p 2 ) 2 da i + f (/j(xj) - h) 2 df2i] 

i=1 \Jiri(Di) J*i(D 2 ) JQl ) 

Tip Up 

= / Xi r i°idA + y; / (Ii(-Xi) - h) 2 df2i, 



' Oi 



(9) 



where J) = — pi) 2 + (1 — %(</>)) (/* — p 2 ) 2 — {h — h) 2 . Note that we have 

dropped the arguments for R and <f> for ease of notation. Since, for a fixed h, 
Y)i = i Sq .(7i(x.j) — h) 2 dQ-i does not depend upon the unknown surface, we only 
need to compute the variation of the first term Y^i-i Is XiR a idA with respect 
to S. We prove in the technical report [14] that the gradient descent flow for 
minimizing cost functionals of a general type YIi= i J s XiR a idA takes the form: 



n ^ 

S t = Y,z? ( Fi (*x, R I X i) - Xi (Jix, Rj Xi) ) N , 



(10) 



where Xi X and r) x denote the derivatives of Xi and T) with respect to X res- 
pectively. We further note that (J ix ,i2?’X i ) = 0 [16] and obtain 

(r^RfXi) = S(4>)((Ii - pi) 2 - (Ii - P 2 ) 2 ) (VsfrRfXi) ■ (11) 
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Therefore, the whole gradient descent flow for the cost (6) is given by 



s t = (E^^x^rx^-x^^-pi) 2 

, M - 

nvs^ir 



z=i 



+2a« + /3^^II(Vs^ x IV)) IV. 



{Ii ~ P2) 2 ) (Vs</>, f?fX 4 ) 



(12) 



Note that flow (12) depends only upon the image values, not the image gradients. 
This property greatly improves the robustness of the resulting algorithm to image 
noise when compared to other variational approaches [3] to stereo based on 
image-to-image matching (i.e. less prone to become “trapped” in local minima). 



3.2 Updating the Curve 

We show in the technical report [14] that the gradient descent flow for C related 
to the smoothness of the curve is given by 

n 

Ct = (^2 ((Ii - P 2) 2 - (Ii - Pi) 2 )<Ti + (3K g ^) n, (13) 

i=l 

where n g is the geodesic curvature and n is the normal to the curve in Tp(S ) 
(commonly referred to as the intrinsic normal to the curve C). Since n G Tp(S), 
C stays in S as it evolves according to equation (13). 



3.3 Updating the Radiances 

Finally, the optimization with respect to the radiances can be solved in closed 
forms as: 



_ £?=i L i( Dp Uix-Ddtii 

Pl ~ Er =1 /, i(Dl) ^ 

_ £?=1 L i( D 2 ) h^iidOi 

<P2 ~ E? =1 /rr i( D 2 ) dQ i 

£?=1 

, E" = l f Q c dO; ’ 



(14) 



i.e., the optimal values are the sample averages of the intensity values in corre- 
sponding regions. 



4 A Few Words on the Numerical Implementation 

In this section, we report some details on implementing the proposed algorithm. 
Both the surface and curve evolutions are carried out in the level set framework 
[4]. Since there has been a lot of work on shape reconstruction using level set 
methods and the space is limited, we refer the interested reader to [2,3,17] for 
general issues. We would like to point out that we do not include the term (8) 
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in the implementation of the surface evolution because experimental testing has 
empirically shown that convinging results can be obtained even neglecting this 
term given its localized influence only near the segmenting curve. The numerical 
implementation of equations (14) should be also straightforward, since one only 
needs to compute the sample average of the intensities in the regions TTi(Di), 
TTi(D 2 ) and Q?. Therefore, we will devote the rest of this section to issues related 
to the implementation of the flow (13). 

Note that flow (13) is not a simple planar curve evolution. The curve is 
defined on the unknown surface, and therefore its motion has to be constrained 
on the surface. (It does not make sense to move the curve freely in R 3 , which 
would lead the curve out of the surface.) The way we approach the problem is 
to exploit the radiance characteristic function <fi. Our approach is similar to the 
one considered by [11,12]. Recall that C is the zeros of <j>. We can express all 
the terms related to C using (j>. In particular, the geodesic curvature is given by 
(we refer the reader to [14] for details on deriving this expression and the rest 
equations in this section): 

S = Vs ‘ (||V^||) HVs^ll _ ( V ^ ,Vs (l|Vs<( ) ||)) 

As4> Vg<^V|0V s 0 . . 

" llVs^ll ||V S 0||3 ’ 1 J 

where and As<j> denote the intrinsic Hessian and the intrinsic Laplacian of 
( i> respectively. After representing the curve C with <f>, we can implement the 
curve evolution by evolving the function (j> on the surface. We further relax 4 <j> 
from being a function defined on S to being a function defined on R 3 . This is 
related to the work of [13] for smoothing functions on surfaces. We denote with 
ip the extended function. We can then express the intrinsic gradient as follows: 

S7 S (j> = - (V<p, N)N, (16) 



and the intrinsic Hessian as follows: 



V| 4> = (I - NN T )\7 2 ip(I - NN t ) - (N T \7tp) 



(/ - NN T )V' 2 ip(I - NN t ) 



IIVV’II 



(17) 



where V 2 stands for the standard Hessian in space and if) is the level set function 
for S. As4> can be computed as 

A s <t> = trace (V|</>) = Aip - 2nN T S7p - N T V 2 pN. (18) 

Finally the curve evolution (13) is given by updating the following partial diffe- 
rential equation 

</>t = II Vs^ll X] (( J i - P*) 2 - (!i ^ Pi) 2 )^ + P(a s <I> - . ( 19 ) 



4 This relaxation does not necessarily have to cover the entire R 3 . It only needs to 
cover the regions where the numerical computation operates. 
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Fig. 3. (COLOR) The left 2 images are 2 out of 26 views of a synthetic scene. The scene 
consists of two spheres, each of which is painted in black with the word “ECCV”. The 
rest of the spheres is white and the background is gray. Each image is of size 257 x 257. 
The right 2 images are 2 out of 31 images from the “nemo” dataset. Each image is of 
size 335 x 315 and calibrated manually using a calibration rig. 



with tfi replaced by ip and Vs^, V|<^> and As4> replaced by the corresponding 
terms of ip according to equations (16), (17) and (18). 

5 Experiments 

In Figure 3 (left 2 images) we show 2 out of 26 views of a synthetic scene, which 
consists of two spheres. Each image is of size 257 x 257. Each sphere is painted 
in black with the word “ECCV” and the rest is white. The background is gray. 
Clearly modeling this scene with one single constant radiance would lead to gross 
errors. One cannot even reconstruct either white or black part using the smooth 
radiance model in [1,2] due to occlusions. For comparison purpose, we report 
the results of our implementation of [1] in Figure 4 (the right 2 images). The 
left 4 images in Figure 4 show the final reconstructed shape using the proposed 
algorithm. The red curve is where the discontinuities of the radiance are. The 
explicit modeling of radiance discontinuities may enable further applications. 
For instance, one can flatten the surface and the curve and perform character 
recognition of the letters. The numerical grid used in both algorithms is the 
same and of size 128 x 128 x 128. In Figure 5 we show the surface evolving 
from a large cylinder to a final solid model. The foreground in all the images 
is rendered with its estimated radiance values (p± and p 2 ) and the segmenting 
curve is rendered in red. In Figure 6 we show the images reconstructed using 
the estimated surface, radiances and segmenting curve compared with an actual 
image from the original dataset. 

In Figure 3 (right 2 images) we show 2 out of 31 views of a real scene, 
which contains a plush model of “nemo” . The intrinsics and extrinsics of all the 
images are calibrated off-line. Each image is of size 335 x 315. Nemo is red with 
white stripes. For the proposed algorithm to work with color images, we have 
extended the model (6) as follows: we consider images to take vector values (RGB 
color in our case) and modify the square error between scalars in equation (6) 
to the simple square of Euclidean vector norm. In Figure 7 we show several 
shaded views of the final reconstructed shape using the proposed algorithm. The 
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Fig. 4. (COLOR) The first 4 images are shaded views of the final shape estimated using 
the proposed algorithm. Radiance discontinuities have been rendered as red curves. 
The location of the radiance discontinuities can be exploited for further purposes, for 
instance character recognition. The last 2 images are the results of assuming that the 
foreground has constant radiance, as in [1], Note that the algorithm of [1] cannot 
capture all the white parts or all the black parts of the spheres, because that is not 
consistent with the input images due to occlusions. 

N H N N 




Fig. 5. (COLOR) Rendered surface during evolution. The foreground in all the images 
is rendered with the current estimate of the radiance (pi and p 2 ) plus some shading 
effects for ease of visualization. 




Fig. 6. The first image is just one view from the original dataset. The remaining 6 
images are rendered using estimates from different stages of the estimation process. In 
particular, the second image is rendered using the initial data and the last image is 
rendered using the final estimates. 



radiance discontinuities are rendered as green curves. The numerical grid used 
here is of size 128 x 60 x 100. In Figure 8 we show the surface evolving from 
an initial shape that neither contains nor is contained in the shape of the scene, 
to a final solid model. The foreground in all the images is rendered with its 
estimated radiance values (pi and P 2 ) and the segmenting curve is rendered in 
green. In Figure 9 we show the images reconstructed using the estimated surface, 
radiances and segmenting curve compared with one actual image in the original 
dataset. 
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Fig. 7. (COLOR) Several shaded views of the final reconstructed surface. The radiance 
discontinuities have been highlighted in green. 

# «1> *I> *1> 

^ 

Fig. 8. (COLOR) Rendered surface during evolution. Notice that the initial surface 
is neither contained nor contains the actual object. The foreground in all the images 
are rendered with the current estimate of the radiance values (pi and P 2 ) plus some 
shading effects for ease of visualization. 




Fig. 9. (COLOR) The first image is just one view from the original data set. The 
remaining 6 images are rendered using estimates from different stages of the estimation 
process. In particular, the second image is rendered using the initial data and the last 
image is rendered using the final estimates. 



6 Conclusions 

We have presented an algorithm to reconstruct the shape and radiance of a 
Lambertian scene with piecewise constant radiance from a collection of cali- 
brated views. We set the problem in a variational framework and minimize a 
cost functional with respect to the unknown shape, unknown radiance values 
in each region, and unknown radiance discontinuities. We use gradient-descent 
partial differential equations to simultaneously evolve a surface in space (shape), 
a curve defined on the surface (radiance discontinuities) and radiance values of 
each region, which are implemented numerically using level set methods. 
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Abstract. Estimating human pose in static images is challenging due 
to the high dimensional state space, presence of image clutter and am- 
biguities of image observations. We present an MCMC framework for 
estimating 3D human upper body pose. A generative model, comprising 
of the human articulated structure, shape and clothing models, is used 
to formulate likelihood measures for evaluating solution candidates. We 
adopt a data-driven proposal mechanism for searching the solution space 
efficiently. We introduce the use of proposal maps, which is an efficient 
way of implementing inference proposals derived from multiple types of 
image cues. Qualitative and quantitative results show that the technique 
is effective in estimating 3D body pose over a variety of images. 



1 Estimating Pose in Static Image 

This paper proposes a technique for estimating human upper body pose in static 
images. Specifically, we want to estimate the 3D body configuration defined by 
a set of parameters that represent the global orientation of the body and body 
joint angles. We are focusing on middle resolution images, where a person’s 
upper body length is about 100 pixels or more. Images of people in meetings or 
other indoor environment are usually of this resolution. We are currently only 
concerned with estimating the upper body pose, which is relevant for indoor 
scene. In this situation the lower body is often occluded and the upper body 
conveys most of a person’s gestures. We do not make any restrictive assumptions 
about the background and the human shape and clothing, except for not wearing 
any head wear nor gloves. 



1.1 Issues 

There are two main issues in pose estimation with static images, the high di- 
mension state space and pose ambiguity. 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3022, pp. 126-138, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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High Dimension State Space. Human upper body pose has about 20 pa- 
rameters and pose estimation involves searching in a high dimensional space 
with complex distribution. With static images, there is no preceding pose for 
initializing the search, unlike a video tracking problem. This calls for an effi- 
cient mechanism for exploring the solution space. In particular, the search is 
preferably data-driven, so that good solution candidates can be found easily. 



Pose Ambiguity. From a single view, the inherent non-observability of some of 
the degrees of freedom in the body model leads to forwards/backwards flipping 
ambiguities [10] of the depth positions of body joints. Ambiguity is also caused 
by noisy and false observations. This problem can be partly alleviated by using 
multiple image cues to achieve robustness. 

1.2 Related Work 

Pose estimation on video has been addressed in many previous works, either 
using multiple cameras [3] or a single camera [2,9]. Many of these works used the 
particle filter approach to estimate the body pose over time, by relying on a good 
initialization and temporal smoothness. Observation-based importance sampling 
scheme has also been integrated into this approach to improve robustness and 
efficiency [5]. 

For static images, some works have been reported for recognizing prototype 
body poses using shape context descriptors and exemplars [6]. Another related 
work involves the mapping of image features into body configurations [8] . These 
works however rely on either a clean background or that the human is segmented 
by a background subtraction and therefore not suitable for fully automatic pose 
estimation in static images. 

Various reported efforts were dedicated to the detection and localization of 
body parts in images. In [4,7], the authors modeled the appearance and the 
2D geometric configuration of body parts. These methods focus on real-time 
detection of people and do not estimate the 3D body pose. Recovering 3D pose 
was studied in [1,11], but the proposed methods assume that image positions of 
body joints are known and therefore tremendously simplify the problem. 



2 Proposed Approach 

We propose to address this problem, by building an image generative model and 
using the MCMC framework to search the solution space. The image generative 
model consists of (z) human model, which encompasses the articulated structure, 
shape and the type of clothing, (it) scene-to-image projection, and (in) genera- 
tion of image features. The objective is to find the human pose that maximizes 
the posterior probability. 

We use the MCMC technique to sample the complex solution space. The set 
of solution samples generated by the Markov chain weakly converges to a statio- 
nary distribution equivalent to the posterior distribution. Data-driven MCMC 
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framework [13] allows us to design good proposal functions, derived from image 
observations. These observations include face, lread-shoulder contour, and skin 
color blobs. These observations, weighted according to their saliency, are used to 
generate proposal maps, that represent the proposal distributions of the image 
positions of body joints. These maps are first used to infer solutions on a set of 
2D pose variables, and subsequently generate proposals on the 3D pose using 
inverse kinematics. The proposal maps considerably improve the estimation, by 
consolidating the evidences provided by different image cues. 

3 Human Model 

3.1 Pose Model 

This model represents the articulated structure of the human body and the 
degree of freedom in human kinematics. The upper body consists of 7 joints, 10 
body parts and 21 degree of freedom (6 for global orientation and 15 for joint 
angles). We assume an orthographic projection and use a scale parameter to 
represent the person height. 

3.2 Probabilistic Shape Model 

The shape of each body part is approximated by a truncated 3D cone. Each cone 
has three free parameters: the length of the cone and the widths of the top and 
base of the cone. The aspect ratio of the cross section is assumed to be constant. 
Some of the cones share common widths at the connecting joints. In total, there 
are 16 shape parameters. As some of the parameters have small variances and 
some are highly correlated, the shape space is reduced to 6 dimensions using 
PCA, and this accounts for 95% of the shape variation in the training data set. 

3.3 Clothing Model 

This model describes the person’s clothing to allow the hypothesis on where the 
skin is visible, so that observed skin color features can be interpreted correctly. 
As we are only concerned with the upper body, we use a simple model with 
only one parameter that describes the length of the sleeve. For efficiency, we 
quantized this parameter into five discrete levels, as shown in Figure la. 

4 Prior Model 

We denote the state variable as m, which consists of four subsets: (i) global 
orientation parameters: g , (ii) local joint angles: j, (in) human shape parameters: 
s, and (iv) clothing parameter: c. 

m = {g,i,s,c} . (1) 

Assuming that the subsets of parameters are independent, the prior distribution 
of the state variable is given by: 

p(m) ~ p(g)p(j)p(s)p(c). 



(2) 
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Global Orientation Parameters. The global orientation parameters consist 
of image position (x g ), rotation parameters (r g ) and a scale parameter (h g ). We 
assume these parameters to be independent so that the following property holds: 

P(g) ~P(x g )p(r g )p(h g ). ( 3 ) 

The prior distributions are modeled as normal distributions and learned from 
training data. 



Joint Angles Parameters. The subset j, consists of 15 parameters describing 
the joint angles at 7 different body joint locations. 

j = {jii * = neck , left-wrist, left-elbow , . . . , right shoulder} . (4) 

In general, the joint angles are not independent. However, it is impracticable to 
learn the joint distribution of the 15 dimensional j vector, with a limited training 
data set. As an approximation, our prior model consists of joint distribution of 
pair-wise neighboring body joint locations. For each body location, i, we specify 
a neighboring body location as its parent, where: 

par ent(le ftjwrist) = left-elbow par ent(right -wrist) = right-elbow 

parent(left-elbow) = leftshoulder parent(right-elbow) = right shoulder 
parent(leftshoulder) = torso parent(right shoulder ) = torso 

parent(neck) = torso parent (tor so) = 0 

The prior distribution is then approximated as: 
p(j) w A 

pose Y[p(ji ) + (i - ^ pose )Y[p(ji ijparent(i )) (^) 

i i 

where A pose is a constant valued between 0 and 1. The prior distributions p(jf) 
and p(ji,j P arent(i)) are modeled as Gaussians. The constant \ pose is estima- 
ted from training data using cross-validation, based on the maximum likelihood 
principle. 



Shape Parameters. PCA is used to reduce the dimensionality of the shape 
space by transforming the variable s to a 6 dimensions variable s' and the prior 
distribution is approximated by a Gaussian: 

p(s) Kip(s') w N(s',fj, 8 >,E 8 ,) (6) 

where g, s > and E s > are the mean and covariance matrix of the prior distribution 
of s'. 



Clothing Parameters. The clothing model consists of a discrete variable c, 
representing the sleeve length. The prior distribution is based on the empirical 
frequency in the training data. 
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Marginalized distribution of image positions of body joints. We denote 
{«;} as the set of image positions of body joints. Given a set of parameters 
{<?, j, s}, we are able to compute the image position of each body joints { iq }: 

Ui = fi(g,j,s) (7) 

where /$(.) is a deterministic forward kinematic function. Therefore, there exists 
a prior distribution for each image position: 



P(Ui) 



s)p{g)p{j)p(s)dgdjds 



(8) 



where p{ui) represents the marginalized prior distribution of the image position 
of the i-tli body joint. In fact, any variable that is derived from image positions 
of the body joints has a prior distribution, such as the lengths of the arms in 
the image or the joint positions of the hand and elbow. As will be described 
later, these prior distributions are useful in computing weights for the image 
observations. The prior distribution of these measures can be computed from 
Equation (8) or it can be learned directly from the training data as was performed 
in our implementation. 



5 Image Observations 

Image observations are used to compute data driven proposal distribution in the 
MCMC framework. The extraction of observations consists of 3 stages: (i) face 
detection, (ii) head-slroulders contour matching, and (Hi) skin blobs detection. 



5.1 Face Detection 

For face detection, we use the Adaboost technique proposed by [12]. We denote 
the face detection output as a set of face candidates, 

d^Face — {d Face -Position! dFaceSize }> (9) 

where dp ace -Position is the detected face location and dpaceSize is the estimated 
face size. The observation can be used to provide a proposal distribution for the 
image head position, UHead, modeled as a Gaussian distribution: 



q(uHead\lFace) ~ A r (uHead ~ dFace Position ? * ? * ). (10) 

The parameters of the Gaussian are estimated from training data. The above 
expression can be extended to handle multiple detected faces. 

5.2 Head-Shoulder Contour Matching 

Contour Model for Head-Shoulder. We are interested in detecting 2D con- 
tour of the head and shoulders. Each contour is represented by a set of connected 
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points. This contour is pose and person dependent. For robustness, we use a mix- 
ture model approach to represent the distribution of the 2D contour space. Using 
a set of 100 training data, a K-mean clustering algorithm is used to learn the 
means of 8 components, as shown in Figure lb. The joint distributions of these 
contour and the image position of head, neck and shoulders are also learned from 
the training data. 





Fig. 1 . Models: (a) Quantized sleeve length of clothing model, (b) components of head- 
shoulder model. 



Contour Matching. In each test image, we extract edges using the Canny 
detector, and a gradient descent approach is used to align each exemplar contour 
to these edges. We define a search window around a detected face, and initiate 
searches at different positions within this window. This typically results in about 
200 contour candidates. The confidence of each candidate is weighted based 
on (i) confidence weight of the detected face, (ii) joint probability of contour 
position and detected face position, and (Hi) edge alignment error. The number 
of candidates is reduced to about 50, by removing those with low confidence. 

The resulting output is a set of matched contours { I Head_Shouider, i}- Each 
contour provides observations on the image positions of the head, neck , left- shoul- 
der and right shoulder, with a confidence weight wrs,i- 

I H ead-S houlder,i = { W H S,i ? I H ead-Pos ,i ? lNeck-Pos,i j I LShoulder-Pos ,i > -Tr Shoulder -P os ,i }■ (ii) 

Each observation is used to provide proposal candidates for the image positions 
of the head ( u He ad ), left shoulder ( u L _Shouider ), right shoulder ( u R _shouider ), and 
neck ( UNeck )• The proposal distributions are modeled as Gaussian distributions 
given by: 

ri(^ldHead\^ Head-Shoulder, i ) ^ HJ H S ,i-N (u H ea d ^He a d-Pos,i, *, •) 

C[(UNeck\I Head-Shoulder, i) ^ WHS,iN (UNeck d-Neck-Pos,ii '5 *) 

q(;UL_Shoulder\I Head-Shoulder, i) ~ W H S,iN(u L _ Shoulder II -Shoulder _P os, ii *? *) 

Q^^R-Shoulder \ I Head Shoulder, i) ^ ^ H i^R-Shoulder 1 R -Shoulder _P os, ii *5 *) 

(12) 

The approach to combine all these observations is described in Section 5.4. 
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Fig. 2. Image observations, from left: (i) original image, (ii) detected face and head- 
shoulders contour, (iii) skin color ellipses extraction. 



5.3 Elliptical Skin Blob Detection 

Skin color features provide important cues on arms positions. Skin blobs are de- 
tected in four sub-stages: (i) color based image segmentation is applied to divide 
the image into smaller regions, (ii) the probability of skin for each segmented 
region is computed using a histogram-based skin color Bayesian classifier, (iii) 
ellipses are fitted to the boundaries of these regions to form skin ellipse candi- 
dates, and (iv) adjacent regions with high skin probabilities are merged to form 
larger regions (see Figure 2). 

The extracted skin ellipses are used for inferring the positions of limbs. The 
interpretation of a skin ellipse is however dependent on the clothing type. For 
example, if the person is wearing short sleeves, then the skin ellipse represent the 
lower arm, indicating the hand and elbow positions. However, for long sleeve, the 
skin ellipse should cover only the hand and used for inferring the hand position 
only. Therefore the extracted skin ellipses provide different sets of interpretation 
depending on the hypothesis on the clothing type in the current Markov chain 
state. 

For clarity in the following description, we assume that the clothing type 
is short sleeve. For each skin ellipse, we extract the two extreme points of the 
ellipse along the major axis. These points are considered as plausible candidates 
for the lrand-elbow pair, or elbow-lrand pair of either the left or right arm. Each 
candidate is weighted by (i) skin color probability of the ellipse, (ii) likelihood 
of the arm length, (iii) joint probability of the elbow, hand positions with one 
of the shoulder candidates (For each ellipse, we find the best shoulder candidate 
that provides the highest joint probability.) 
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5.4 Proposal Maps 

In this section we present the new concept of proposal maps. Proposal maps are 
generated from image observation to represent the proposal distributions of the 
image positions of body joints. For this discussion, we focus on the generation of 
a proposal map for the left hand. Using the skin ellipse cues presented earlier, we 
generate a set of hypotheses on the left hand position, { Ih_Hand,i , i = 1,. ■ ■ , Nh}, 
where Nh is the number of hypotheses. Each hypothesis has an associated weight 
w L_Hand,i and a covariance matrix Sl Hand, i representing the measurement un- 
certainty. From each hypothesis, the proposal distribution for the left hand image 
position is given by: 



q(uL_Hand\I L_Hand,i) OC WL_Hand,iN(uL _ Hand 5 II _Hand,i 1 £l _Hand,i ). (13) 

Contributions of all the hypotheses are combined as follows: 

Q.{^L_Hand L_Hand,iY) OC max Q^UL_Hand \^-L_Hand,i) • (14) 

i 

As the hypotheses are, in general, not independent, we use the max function 
instead of the summation in Equation (14); otherwise peaks in the proposal dis- 
tribution would be overly exaggerated. This proposal distribution is unchanged 
throughout the MCMC process. To improve efficiency, we approximate the dis- 
tribution as a discrete space with samples corresponding to every pixel position. 
This same approach is used to combine multiple observations for other body 
joints. Figure 3 shows the pseudo-color representation of the proposal maps for 
various body joints. Notice that the proposal maps have multiple modes, espe- 
cially for the arms, due to ambiguous observations and image clutter. 



6 Image Likelihood Measure 

The image likelihood P(I\m) consists of two components: (i) a region likelihood, 
and (it) a color likelihood. We have opted for an adaptation of the image like- 
lihood measure introduced in [14]. 



Region Likelihood. Color segmentation is performed to divide an input image 
into a number of regions. Given a state variable m, we can compute the corre- 
sponding human blob in the image. Ideally, the human blob should match to the 
union of a certain subset of the segmented regions. 

Denoting {Ri,i = 1, . . . , N region } as the set of segmented regions, N reglon is 
the number of segmented regions and H m the human blob predicted from the 
state variable m. For the correct pose, each region Ri should either belong to 
the human blob H rn or to the background blob H m . In each segmented region 
Ri, we count the number of pixels that belong to H m and H m . 

Ni, human = count pixels (u,v) where (u,v) G Ri and (u,v) G H m ,_ , , 

Ni,, background = count pixels (u,v) where (u,v) G R t and (u,v) G H m . 
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Original Image 



Fig. 3. Proposal maps for various body joints. The proposal probability of each pixel 
is illustrated in pseudo-color (or grey level in monochrome version). 





Head Neck Left Shoulder Right Shoulder 



Left Elbow Right Elbow Left Wrist Right Wrist 



We define a binary label, li for each region and classify the region, so that 



I J 1 if Ni, human ^ background 

1 0 otherwise 



We then count the number of incoherent pixels, Ni nco h erent , given as: 

N region 

N incoherent — E (Ni 

, background) (-^i, human) 



( 16 ) 



( 17 ) 



The region-based likelihood measurement is then defined by: 



J region 



= exp(— A region N incoherent ) 



( 18 ) 



where A r 



is a constant determined empirically using a Poisson model. 



Color Likelihood. The likelihood measure expresses the difference between the 
color distributions of the human blob H m and the background blob H rn . Given 
the predicted blobs H rn and H m , we compute the corresponding color distribu- 
tions, denoted by d and b. The color distributions are expressed by normalized 
histograms with Nhistogram bins. The color likelihood is then defined by: 

decolor ~ ^ X P( '^color-B ( i b) (19) 

where A co ior is a constant and is the Blrattachayya coefficient measuring 
the similarity of two color distributions and defined by: 

N h istogram 

Bd,b = ^ ^ V dibi . 

i = 1 



(20) 
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The combined likelihood measure is given by : 

= Lregion X L co [ or . (21) 



7 MCMC and Proposal Distribution 



We adapted the data-driven MCMC framework [13]. which allows the use of 
image observations for designing proposal distribution to find region of high 
density efficiently. At the f-tli iteration in the Markov chain process, a candidate 
m! is sampled from q(mt\mt-i) and accepted with probability, 



p = min 



1 p{m'\I)q{m t -i\m!) \ 

’ p(mt-i\I)q{m'\m t -i) J 



(22) 



The proposal process is executed by three types of Markov chain dynamics de- 
scribed in the following. 



Diffusion Dynamic. This process serves as a local optimizer and the proposal 
distribution is given by: 

q{m'\m t -i) oc N(m',m t - 1 , Sdif fusion) (23) 

where the variance Sdi f fusion is set to reflect the local variance of the posterior 
distribution, estimated from training data. 



Proposal Jump Dynamic. This jump dynamic allows exploratory search 
across different regions of the solution space using proposal maps derived from 
observation. In each jump, only a subset of the proposal maps is used. For this 
discussion, we focus on observations of the left hand. To perform a jump, we 
sample a candidate of the hand position from the proposal map: 



U L_hand ^ qif^L^hand | { ^L^hand,i\^} • 



(24) 



The sampled hand image position is then used to compute, via inverse kinematics 
(IK), a new state variable m ' that satisfies the following condition: 






1 ) where j yf LJiand 
UL_hand where j = LJiand 



(25) 



where fi(rrit-i) is the deterministic function that generates image position of a 
body joint, given the state variable. In other words, IK is performed by keeping 
other joint positions constant and modify the pose parameters to adjust the 
image position of the left hand. When there are multiple solutions due to depth 
ambiguity, we choose the solution that has the minimum change in depth. If m! 
cannot be computed (e.g. violate the geometric constraints), then the proposed 
candidate is rejected. 
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Flip Dynamic. This dynamic involves flipping a body part (i.e. head, hand, 
lower arm or entire arm) along depth direction, around its pivotal joint [10]. Flip 
dynamic is balanced so that forward and backward flips have the same proposal 
probability. The solution candidate m! is computed by inverse kinematics. 

8 Experimental Results 

We used images of indoor meeting scenes as well as outdoors images for testing. 
Ground truth is generated by manually locating the positions of various body 
joints on the images and estimating the relative depths of these joints. This data 
set is available at http://www-scf.usc.edu/~munlee/PoseEstimation.html. 



8.1 Pose Estimation 

Figure 4 shows the obtained results on various images. These images were not 
among the training data. The estimated human model and its pose (solutions 
with the highest posterior probability) are projected onto the original image and 
a 3D rendering from a sideward view is also shown. 

The estimated joint positions were compared with the ground truth data, 
and a RMS error was computed. Since the depth had higher uncertainties, we 
computed two separate measurements, one for the 2D positions, and the other 
for the depth. The histograms of these errors (18 images processed) are shown 
in Figure 5a. This set of images and the pose estimation results are available at 
the webpage: http://www-scf.usc.edu/~munlee/images/upperPoseResult.htm. 



8.2 Convergence Analysis 

Figure 5b shows the RMS errors (averaged over test images) with respect to 
the MCMC iterations. As the figure shows, the error for the 2D image position 
decreases rapidly from the start of the MCMC process and this is largely due to 
the observation-driven proposal dynamics. For the depth estimate, the kinema- 
tics flip dynamic was helpful in finding hypotheses with good depth estimates. 
It however required a longer time for exploration. The convergence time varies 
considerably among different images, depending on the quality of the image ob- 
servations. For example, if there were many false observations, the convergence 
required a longer time. On average, 1000 iterations took about 5 minutes. 

9 Conclusion 

We have presented an MCMC framework for estimating 3D human upper body 
pose in static images. This hypotlresis-and-test framework uses a generative mo- 
del with domain knowledge such as the human articulated structure and allows 
us to formulate appropriate prior distributions and likelihood functions, for eva- 
luating samples in the solution space. 




Occurrence 
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Fig. 4. Pose Estimation. First Row: Original images, second row: estimated poses 
third row: estimated poses (side view). 
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Fig. 5. Results: (a) Histogram of RMS Error (b) Convergence Analysis. 
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In addition, the concern with high dimensionality and efficiency postulates 
that the searching process should be more driven by image observations. The 
data-driven MCMC framework offers the flexibility in designing proposal mecha- 
nism for sampling the solution space. Our technique incorporates multiple cues 
to provide robustness. 

We introduce the use of proposal map, which is an efficient way of conso- 
lidating information provided by observations and implementing proposal dis- 
tributions. Qualitative and quantitative results are presented to show that the 
technique is effective over a wide variety of images. In future work, we will extend 
our work to full body pose estimation and video-based tracking. 
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Abstract. Optic disc detection is important in the computer-aided ana- 
lysis of retinal images. It is crucial for the precise identification of the ma- 
cula to enable successful grading of macular pathology such as diabetic 
maculopathy. However, the extreme variation of intensity features within 
the optic disc and intensity variations close to the optic disc boundary 
presents a major obstacle in automated optic disc detection. The pre- 
sence of blood vessels, crescents and peripapillary chorioretinal atrophy 
seen in myopic patients also increase the complexity of detection. Exi- 
sting techniques have not addressed these difficult cases, and are neither 
adaptable nor sufficiently sensitive and specific for real-life application. 
This work presents a novel algorithm to detect the optic disc based on 
wavelet processing and ellipse fitting. We first employ Daubechies wa- 
velet transform to approximate the optic disc region. Next, an abstract 
representation of the optic disc is obtained using an intensity-based tem- 
plate. This yields robust results in cases where the optic disc intensity is 
highly non-homogenous. Ellipse fitting algorithm is then utilized to de- 
tect the optic disc contour from this abstract representation. Additional 
wavelet processing is performed on the more complex cases to improve 
the contour detection rate. Experiments on 279 consecutive retinal ima- 
ges of diabetic patients indicate that this approach is able to achieve an 
accuracy of 94% for optic disc detection. 



1 Introduction 

Digital retinal images are widely used in the diagnosis and follow-up management 
of patients with eye disorders such as glaucoma, diabetic retinopathy, and age- 
related macular degeneration. Glaucoma is the second leading cause of blindness 
in the world, affecting some 67 to 105 million patients [20]. In glaucoma, an 
abnormally raised intraocular pressure damages the optic nerve and results in 
morphological changes in the optic disc. This leads to an increase in the size of 
the optic cup. Diabetic retinopathy is also a leading cause of blindness and visual 
impairment in many developed countries and accounts for 12,000 to 24,000 blind 
cases in United States alone every year [5]. 



T. Pajdla and .7. Matas (Eds.): ECCV 2004, LNCS 3022, pp. 139-151, 2004. 
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The automated detection of optic disc has several potential clinical uses. 
First, the vertical diameters of the optic cup and disc may aid the diagnosis 
of glaucoma [3]. Changes in these parameters of the optic disc in serial images 
may indicate progression of the disease. Second, it allows the identification of 
the macula using the spatial relationship between the optic disc and macula. 
The macula is located on the temporal aspect of optic disc and is situated at 
a distance of about 2.5 disc diameters from the centre of the optic disc [9]. 
Occurrence of lesions in the macula region as a result of diabetic retinopathy 
and age-related macular degeneration are often sight-threatening. Identifying 
the macula allows highly sensitive algorithms to be designed to detect signs of 
abnormality in the macular region. 

The optic disc appears as an elliptical region with high intensity in retinal 
images (see Fig. 1). The vertical and horizontal diameters of an optic disc are 
typically 1.82 ± .15mm and 1.74 ± .21 mm respectively [3]. Clinically, optic disc 
measurements can be obtained by approximating the disc to an ellipse [2]. 




Fig. 1 . (a) Outline of optic disc (white ellipse), (b) Outline of optic disc with peripa- 
pillary chorioretinal atrophy (black arrows) 



While existing algorithms [8,10,13,14,15,16,18] employ a variety of techniques 
to detect optic disc, they are neither sufficiently sensitive nor specific enough for 
clinical application. The main obstacle is the extreme variation of the optic disc 
intensity features and the presence of retinal blood vessels (Fig. 1(a)). Peripapil- 
lary chorioretinal atrophy which are commonly seen in myopic eyes also increase 
the complexity of optic disc detection. This presents as a bright crescent-shaped 
area adjacent to the optic disc, usually on its temporal side (Fig. 1(b)), or as a 
bright annular (doughnut-shaped) area surrounding the optic disc. 

Our proposed approach overcomes the above challenges as follows. We first 
approximate the optic disc boundary via the use of Daubechies wavelet transform 
and intensity-based techniques. Next, an ellipse fitting algorithm is employed to 
detect the optic disc contour in the optic disc boundary region. Experiments 
on 279 consecutive retinal images disclosed that we were able to achieve an 
accuracy of 94% for optic disc detection and 93% accuracy based on mean vertical 
diameter assessment . 
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2 Related Work 

There has been a long stream of research to automate optic disc detection. 
Techniques such as active contour models [10,16], template matching [8], pyra- 
midal decomposition [8], variance image calculation [18] and clustering techni- 
ques [13] have been developed. Among them, active contour-based models have 
been shown to give better results compared to the other techniques. We eva- 
luate active contour models on optic discs ranging from least contour variants 
to complex variations, and discuss their result and limitations here. 

Snakes or active contours [7,11,17] are curves defined within an image domain 
that can move under the influence of internal forces coming from the curve 
itself and external forces computed from the image data. There are two types 
of active contour models: parametric active contours [22] and geometric active 
contours [23]. Parametric active contours synthesize parametric curves within 
image domains and allow them to move towards desired features, usually edges. 
A traditional snake is a curve X(s) = [x(s), y(s)], s £ [0, 1], that moves through 
the spatial domain of an image to minimize the energy functional 

r 1 i 

E = J 2 [ a l^( s )| 2 " l ” /^l^( s )| 2 ] + l^ext(x(s))ds (1) 

where a, f3, 7 are weighting parameters that control the snake’s tension, rigidity 
and influence of external force, respectively, and i:(s) and x(s) denote the first 
and second derivatives of x(s) with respect to s. The external energy function 
E ext , is derived from the image so that it takes on its smaller values at the features 
of interest, such as boundaries. 

Analysis of the digital retinal images reveals that the use of the gradient 
image to derive the external energy function needed by the active contours model 
is not suitable because the gradient image contains too much noise arising from 
the retinal vessels. Even after removing retinal vessels [6] from the gradient 
image, it may not be complete and the removal process may introduce operator 
that will distort the original optic disc contour. Another option is to use intensity- 
based external force E ext model. Here, we use the gray value of the green layer 
of the original image as the external force. Table 1 shows the results for various 
weight parameters. Note that the intensity-based external force model tends to 
produce poorer results. Further attempts to improve results using morphological 
operators have not been successful due to wide variations in optic disc features. 

D.T. Morris et al. [16] reported the use of active contour models to detect the 
optic disc with a preprocessing step to overcome these problems. Images are first 
preprocessed using histogram equalization. This is followed by the use of pyramid 
edge detector. While this approach shows improved results, it suffers from two 
drawbacks. First, the preprocessing steps may cause the optic disc boundary to 
become intractable because it fuses with the surrounding high intensity regions. 
Second, the pyramid edge detector is unable to filter noise from vessel edges 
adequately and active contour model will fail to outline the optic disc boundary 
correctly. 
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Similarly, while region snakes works well for optic discs with uniform intensity 
distribution, it tends to fail in optic discs having very low intensity or in cases 
where segment of optic disc has very low intensity. Application of deformable 
super-quadrics, or dynamic models with global and local deformation properties 
inherited from super-quadric ellipsoids and membrane splines, may be useful in 
optic disc detection. However, it will fail in cases with peripapillary atrophy, 
where there is a high intensity region next to the optic disc. Further, its high 
computational cost is not suitable for online processing of digital retinal images. 
These limitations motivated us to develop a robust yet efficient technique to 
reliably locate and outline the optic disc. 



Table 1 . Results for active contour models 




3 Optic Disc Localization and Contour Detection 

The major steps in the proposed approach to reliably detect the optic disc in 
large numbers of retinal images under diverse conditions are as follows. First, 
the approximate location of the optic disc is estimated via wavelet transform. 
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The intensity template is employed to construct an abstract representation of 
the optic disc. This abstract representation of the optic disc significantly reduces 
the processing area, thus increasing the computational efficiency. Next, an ellipse 
fitting procedure is applied to detect disc contour and to filter out difficult cases. 
Finally, a wavelet-based high pass filter is used to remove undesirable edge noise 
and to enhance the detection of non-homogenous optic discs. 

Our image database consists of digital retinal images captured using a 
Topcon© fundus camera. All the images are standard 40-degree field of the 
retina centered on the macula. Image resolution is 25micron/pixel. Images were 
stored in 24-bit TIFF format with image size of 768*576 pixels. 

3.1 Localization of Optic Disc Window by Daubechies Wavelet 
Transformation 

Figure 2 shows the different color layers of a typical retinal image. It is evident 
that the optic disc outline is not present in the red layer (Fig. 2(b)) or the blue 
layer (Fig. 2(c)). In contrast, the green layer (Fig. 2(d)) captures the optic disc 
outline. We use this layer for subsequent processing. 




(a) Original Image (b) Red Layer (c) Blue Layer (d) Green Layer 

Fig. 2. Color layers of a retinal image 




(a) (b) (c) (d) 



Fig. 3. Selected optic disc region using Daubechies wavelet transform 



There has been a growing interest to use wavelets as a new transform techni- 
que for image processing. The aim of wavelet transform is to ‘express’ an input 
signal as a series of coefficients of specified energy. We use the Daubechies wa- 
velet [12] to localize the optic disc. First, a wavelet transform is carried out to 
obtain the wavelet coefficients. Next, an inverse wavelet transform is performed 
after thresholding the HH component (high pass in vertical and horizontal di- 
rection) (Fig. 3(b)). The resultant image is then subtracted from the original 
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retinal image to obtain the subtracted image, and its sub-images (16x16 pixels) 
are analyzed (Fig. 3(c)). 

Note that the sub-image with the highest mean value correlates to the area 
inside the optic disc. Hence, the center of the sub-image (Xc, Yc) with the 
highest mean intensity is selected and the optic disc region is defined as a WxW 
window centered at (Xc, Yc). The dimension W is determined by taking into 
consideration the image resolution (25micron/pixel) and the average size of the 
optic disc in the general population. Based on the results, W is set to be 180. 
Fig. 3(d) shows the selected optic disc window. Experiments on 279 digital retinal 
images show 100% accuracy in the detection of optic disc. 



3.2 Abstract Representation of Optic Disc Boundary Region 

We have shown that there exist wide variations in optic disc boundary, from 
clear boundary outlines to very difficult cases with complex boundary outlines. 
To minimize the interference from these complications, we use an abstract re- 
presentation of the optic to capture the optic disc boundary. This has shown to 
give robust results including cases with highly non-homogenous optic discs. 




Fig. 4. Template to localize optic disc boundary 



The abstract representation of the optic disc is in the form of a template is 
shown in Fig. 4. It consists of two circles: an inner circle and an outer circle 
C G . C i denotes the approximated optic disc boundary and the region between 
the C,; and C D is the immediate background. Both C 0 and Ci are concentric 
circles, and the diameter (d Q ) of C Q is defined as 

d 0 = di + K (2) 

The optimal K value is obtained by using a training image set. The optic 
disc is approximated to the template by calculating the intensity ratio ( IR ) as 
follows: 



IR = Mi/M < 



(3) 
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where M,; is the mean intensity of pixels inside the circle C,; and M c is the mean 
intensity of the region between circles C,; and C Q . Vessel pixels are not involved 
in the calculation of mean intensity to increase the accuracy. The abstract re- 
presentation of the optic disc is obtained by searching for the best fitting inner 
circle C*. Fig. 5 and Fig. 6 show the abstract representations obtained. 

The optic disc boundary region is selected as the region between d, ± K 
(Fig. 7). By processing the optic disc at an abstract level rather than pixel level, 
we are able to detect the optic disc boundary region accurately in cases where 
the optic disc is highly non-homogenous. 







Fig. 5. (a), (c) Uniform optic disc images; (b), (d) Fitting of template 




(a) (b) (c) (d) 

Fig. 6. (a), (c) Non-uniform optic disc images; (b), (d) Fitting of template 



3.3 Ellipse Fitting to Detect Optic Disc Contour 

One of the basic tasks in pattern recognition and computer vision is the fitting of 
geometric primitives to a set of points. Existing ellipse fitting algorithms exploits 
methods such as Hough transforms [1], Kalman filtering, fuzzy clustering, or least 
square approach [4]. These can be divided into (1) clustering and (2) optimization 
based methods. 

The first group of fitting techniques includes Hough transform and fuzzy 
clustering, which are robust against outliers and can detect multiple primitives 
simultaneously. Unfortunately, these techniques have low accuracy, are slow and 
require large amount of memory. The second group of fitting methods, which 
includes the Least Square approach [4] , is based on the optimization of an objec- 
tive function that characterizes the goodness of a particular ellipse with respect 
to the given set of data points. The main advantages of this group of methods 
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Fig. 7. (a), (c) Optic disc regions; (b), (d) Isolated optic disc boundary region 



are their speed and accuracy. However, these methods can fit only one primitive 
at a time, that is, the data should be pre-segmented before the fitting. Further, 
they are more sensitive to the effect of outlier compared to clustering methods. 




(a) (b) (c) (d) 

Fig. 8. (a), (c) Sobel edge maps; (b), (d) After vessel removal 



In our proposed ellipse fitting algorithm, a Sobel edge map of the optic disc 
boundary region is used (Fig. 8(a) and (c)). These Sobel images tend to have a 
high degree of noise arising from blood vessel edges and break at a number of 
places. Hence, we first remove all the vessel information by using a retina vessel 
detection algorithm [6] (Fig. 8(b) and (d)). Ellipse fitting algorithm is then used 
to detect optic disc contour from the resultant images. 

Our ellipse fitting algorithm finds the four best fitting ellipses with minimal 
errors. The ellipse center is moved within the area defined by inner circle C^. 
The ellipse major axis a varies between IF/ 2 ± IF/4, while the minor axis b of 
the ellipse is restricted to (1 ± 0.2 ) *a pixels. These conditions are set according 
to the optic disc variations. The best fitting ellipses are given by 



EFi = ^ * (a + b) (4) 

where EFi is the measure of ellipse fitting and P, is the number of edge points 
for the ellipse i. Ellipses having four highest EFi are selected and the intensity 
ratios for the four ellipses are calculated (see equation 3). The ellipse with the 
highest IR whose major and minor axis falls between (1 ± 0.25) d, is regarded 
as the detected optic disc contour. Fig. 9 shows that the ellipse fitting procedure 
is able to accurately detect the optic disc boundary. 
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Fig. 10 depicts a difficult case where the ellipse fitting model detects part of 
the optic cup edge as the optic disc contour. Careful analysis reveals that this 
is due to the presence of optic cup edge points which tends to over-shadow the 
actual edge points of the optic disc boundary. In these situations, a wavelet-based 
enhancement is initiated. 




(a) (b) (c) (d) 



Fig. 9. (a), (c) Four best ellipses superimposed on optic disc region; (b), (d) Correctly 
detected ellipse 




(a) (b) (c) (d) 



Fig. 10. (a) Manual outline of optic disc; (b) Optic disc region green layer; (c) Arrow 
indicate optic cup edge interference; (d) Detected ellipses 



3.4 Enhancement Using Daubechies Wavelet Transformation 

To overcome the problem of noise due to the presence of optic cup points, we 
employ Daubechies wavelet transform [12] to enhance the optic disc boundary. 
This is achieved by performing the inverse wavelet transformation of coefficients 
after filtering out the HH component. This step gives rise to an image whose 
optic cup region has been removed. Fig. 11(a) shows the edge map of an inverse 
thresholded image. Once the edge image has been obtained, we further thres- 
hold the edge image with the image mean. This successfully removes the very 
prominent edge points due to optic cup and gives prominence to the faint optic 
disc boundary edges (Fig. 11(b)). Fig. 11(d) shows an accurately detected optic 
disc boundary after wavelet processing. 
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(a) (b) (c) (d) 



Fig. 11. (a) Sobel edge image after wavelet enhancement; (b) Thresholding with image 
mean; (c) Ellipses selected by algorithm; (d) Best fitting ellipse 



4 Experimental Results 

We evaluated our proposed approach on 279 consecutive digital retinal images. 
The following performance criteria are used: 

(1) Accuracy - Ratio of the number of acceptably detected contours as as- 
sessed by a trained medical doctor over the total number of images. 

(2) Vertical Diameter Assessment - Average ratio of the vertical diameter of 
the detected contour over the vertical diameter of the actual optic disc boundary. 

For criteria (2), the optic disc boundary outline of the images has been ca- 
refully traced by a trained medical doctor and the entire optic disc area is trans- 
formed to gray value of 255 with the background set to 0. The actual vertical 
diameter of the disc boundary is obtained from this transformed image. 

Table 2 shows the results obtained. Without additional wavelet processing, 
the optic disc detection algorithm achieved 86% accuracy and 87% vertical dia- 
meter assessment. Using Daubechies wavelet processing to improve the difficult 
cases, we are able to achieve an accuracy of 94% and vertical diameter asses- 
sment of 93%. This improvement of 8% in accuracy includes the most difficult 
cases where the optic disc is of low intensity and is situated in a neighborhood 
with high intensity variations. 



Table 2. Detection of optic disc contour 





Accuracy 


Vertical Diameter 
Assessment 


Ellipse fitting without 
wavelet processing 


86% (240/279) 


87% 


Ellipse fitting with 
wavelet processing 


94% (262/279) 


93% 
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5 Discussion 

Existing optic disc detection algorithms focus mainly on optic disc localization 
and detection of the optic disc boundary. Optic disc localization is important 
as it reduces the computational cost. [8] propose an optic disc localization al- 
gorithm using pyramidal decomposition. Potential optic disc regions are located 
using Haar wavelet-based pyramidal decomposition and are analyzed using Haus- 
dorff template matching to detect probable optic disc. [18] design a localization 
algorithm based on variance of image intensity. The variance of intensity of ad- 
jacent pixels is used for recognition of the optic disc. The original retinal image 
is subdivided into sub-images and their respective mean intensities are calcula- 
ted. Variance image is formed by a transformation which include mean of the 
sub-image. The location of the maximum of this image is taken as the centre of 
the optic disc. 

[13] employs clustering techniques with simple thresholding to select several 
probable optic disc regions. These regions are clustered into groups and furt- 
her analyzed by principle component analysis to identify the optic disc. This 
algorithm has yielded robust results in images with large high intensity lesions 
such as hard exudates in diabetic retinopathy. The drawbacks are that they are 
time-consuming and the results are not easily reproducible [24]. 

Optic disc contour detection has been attempted with active contour models 
[10,16] and template matching [8]. Active contour models have failed to detect 
optic disc contour accurately due to the presence of noise, various lesions, in- 
tensity changes close to retinal vessels, and other factors. Various preprocessing 
techniques have been employed to overcome these problems, including morpho- 
logical filtering, pyramid edge detection, etc., but no large scale testing has been 
carried out to validate their accuracies. In this work, we have compared the pro- 
posed algorithm with active contour models to validate its robustness. Template 
matching [8] yields better results because it tends to view the optic disc as a 
whole entity rather than processing at pixel level. However, none of the algo- 
rithms has been tested on a large number of images and proven to be sufficiently 
robust and accurate for clinical use. 



6 Conclusion 

In this paper, we have presented an optic disc detection algorithm that employs 
ellipse fitting and wavelet processing to detect optic disc contour accurately. 
Experimental results have shown that the algorithm is capable of achieving 94% 
accuracy for the optic disc detection and 93% accuracy for the assessment of 
vertical optic disc diameter in 279 consecutive digital retinal images obtained 
from patients in a diabetic retinopathy screening program. The assessment of 
vertical optic disc diameter, when combined with parameters such as the vertical 
optic cup diameter, can provide useful information for the diagnosis and follow 
up management of glaucoma patients. 
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Abstract. We develop a novel approach to view-invariant recognition 
and apply it to the task of recognizing face images under widely sepa- 
rated viewing directions. Our main contribution is a novel object repre- 
sentation scheme using ‘extended fragments’ that enables us to achieve 
a high level of recognition performance and generalization across a wide 
range of viewing conditions. Extended fragments are equivalence classes 
of image fragments that represent informative object parts under dif- 
ferent viewing conditions. They are extracted automatically from short 
video sequences during learning. Using this representation, the scheme is 
unique in its ability to generalize from a single view of a novel object and 
compensate for a significant change in viewing direction without using 
3D information. As a result, novel objects can be recognized from vie- 
wing directions from which they were not seen in the past. Experiments 
demonstrate that the scheme achieves significantly better generalization 
and recognition performance than previously used methods. 



1 Introduction 

View-invariance refers to the ability of a recognition system to identify an object, 
such as a face, from any viewing direction, including directions from which the 
object was not seen in the past. View-invariant recognition is difficult because 
images of the same object viewed from different directions can be highly dissi- 
milar. The challenge of view-invariant recognition is to correctly identify a novel 
object based on a limited number of views, from different viewing directions. For 
example, after seeing a single frontal image of a novel face, the same face has to 
be recognized when seen in profile. 

In the current study we develop a scheme for view-invariant recognition ba- 
sed on the automatic extraction and use of corresponding views of informative 
object parts. The approach has two main components. First, objects within a 
class, such as face images, are represented in terms of common ‘building blocks’, 
or parts. The parts we use are sub-images, or object fragments, selected automa- 
tically from a training set during a learning phase. Second, images of the same 
part under different viewing directions are grouped together to form a generali- 
zed fragment that extends across changes in the viewing direction. (We therefore 
refer to a set of equivalent fragments as an ‘extended’ fragment.) The general 
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idea is that the equivalence between different object views will be induced by the 
learned equivalence of the extended fragments. To achieve view invariance, the 
view of a novel object within a class will be represented in terms of the constitu- 
ent parts. The appearances of the parts themselves under different conditions are 
extracted during learning, and this will be used for recognizing the novel object 
under new viewing conditions. We describe below how such extended fragments 
are extracted from training images, and how they are used for identifying no- 
vel objects seen in one viewing direction, from a different and widely separated 
direction. 

The remainder of the paper is organized as follows. In section 2 we review past 
approaches to view-invariant recognition. In section 3 our extended fragments 
representation is introduced, and in section 4 we describe how this representation 
is incorporated in a recognition scheme. In section 5 we present results obtained 
by our algorithm and compare them to a popular PCA-basecl approach. We 
make additional comparisons and discuss future extensions in section 6. 

2 A Review of Past Approaches 

In this section, we give a brief review of some general approaches to view- 
invariant recognition. 

One possible approach to achieve view invariance is to use features that are 
by themselves invariant to pose transformations. The basic idea is to identify 
image-based measures that remain constant as a function of viewing direction, 
and use them as a signature that identifies an object. One well-known measure 
is the four-point cross-ratio, but other, more complicated algebraic invariants, 
have been proposed [1] . Several types of features invariant under arbitrary affine 
transformations were derived and used for object recognition [2,3], and features 
that are nearly invariant were derived for more general transformations [4] . One 
shortcoming of this approach is that it is difficult to find a sufficient number 
of invariant features for reliable recognition, especially when objects that are 
similar in overall shape (such as faces) have to be discriminated. Second, many 
useful features are not by themselves invariant, and consequently their use is 
excluded in the invariant features framework, in contrast with the extended 
features approach described below . (See also comparisons in section 3.) 

Another general approach is to store multiple views of each object to be 
recognized, and possibly apply some form of view-interpolation for intermediate 
views (e.g. [5]). This approach requires multiple views of each novel object, and 
the interpolation usually requires correspondence between the novel object and a 
stored view. Such correspondence turned out in practice to be a difficult problem. 

Having a full 3D model of an object alleviates the need to store multiple 
views, since novel views may be generated from such a model. However, ob- 
taining precise 3D models in practice is difficult, and usually requires special 
measuring equipment (e.g. [6]). Due to this requirement, recognition using 3D 
data is frequently considered separately from image-based methods. For exam- 
ples of 3D approaches, see [7,8]. 

Several methods (elastic graph matching [9], Active Appearance Models [10]) 
use flexible matching to deal with the deformation caused by changes in pose. 
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Since small pose changes tend to leave all features visible and only change the 
distances between them, this approach is able to compensate for small (10 — 
15°) head rotations. A feature common to such approaches is that they easily 
compensate for small pose changes, but the performance drops significantly when 
larger pose changes (e.g. above 45°) are present. 

A popular approach to object recognition in general is based on principal 
components analysis (PCA). When applied to face recognition, this approach is 
known as eigen-faces [11]. Several researchers have used PCA to achieve pose 
invariance. Murase and Nayar [12] acquired images of several objects every four 
degrees. From these images, they constructed an eigenspace representation for 
a given object, and used it for recognizing the object in different poses. A li- 
mitation of this approach is the need to acquire and store a large number of 
views for each object. Pentland et al. [13] developed a similar scheme, applied 
to face images, and using only five views between frontal and profile (inclusive). 
Performance was good when the view to be recognized was at the same orienta- 
tion as the previously seen pictures, but dropped quickly when interpolation or 
extrapolation between views was required. 



3 The Extended Fragments Approach 

Our approach is an extension of object recognition methods in which objects are 
represented using a set of informative sub-images, called fragments or patches 
[14,15,16]. These methods are general and can be applied to a wide variety of 
natural object classes (for example, faces, cars, and animals). In this section we 
describe briefly the relevant aspects of the fragment-based approaches, discuss 
their limitations for view-invariance, and outline our extension based on extended 
fragments. We illustrate the approach using the task of face recognition, but the 
method is general and can be applied to different object classes. 

In fragment-based recognition, informative object fragments are extracted 
during a learning stage. The extraction is based on the measure of mutual in- 
formation between the fragments and the class they represent. A large set of 
candidate fragments is evaluated, and a subset of informative fragments is then 
selected for the recognition process. Informative fragments for face images typi- 
cally include different types of eyes, mouths, hairlines, etc. 

During recognition, this set of fragments is searched for in the target images 
using the absolute value of normalized cross-correlation, given by 

NCC(p, f) = ' (1) 

UpCTf 

Here / is the fragment and p is an image patch of the same size as the fragment. 
Image patches at all locations are evaluated and the one with the highest corre- 
lation is selected. When the correlation exceeds a pre-determined threshold, the 
fragment is considered present, or active , in the image. A schematic illustration 
of this scheme is presented in Figure 1(a). 

Informative fragments have a number of desirable properties [14] . They pro- 
vide a compact representation of objects or object classes and can be used for 
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(a) Fragments-based recogni- 
tion 



(b) Lack of view- (c) Extended frag- 

point-invariance ments 



Fig. 1 . Extended object fragments, (a) Schematic illustration of previous fragments- 
based approaches. Bottom: faces represented in the system. Top: informative fragments 
used for the representation. Lines connect each face to fragments that are present in 
the face, as computed by normalized cross-correlation. Novel faces can be detected 
reliably using a limited set of fragments, (b) Informative fragments are not viewpoint- 
invariant, for instance, a frontal eye fragment (left) is different from the corresponding 
side fragment (right). If the detection threshold is set low enough so that the fragment 
will be active in both images, many spurious detections will occur, and the overall 
recognition performance will deteriorate, (c) View invariance is obtained by introducing 
equivalence sets of fragments. Fragments depicting the same face part viewed from 
different angles are grouped to form an extended fragment. The eye is detected in an 
image if either the frontal or the profile eye templates are found. This is indicated by 
the OR attached to the pair of fragments. 



efficient and accurate recognition. However, fragments used in previous schemes 
are not view-invariant. The reason is that objects and object parts look very dif- 
ferent under different orientations. As a result, fragments that were active e.g. 
in a frontal view of a certain face will not be active in side views of the same 
face. Therefore, the representation by active fragments is not view- invariant. 
This problem is illustrated in Figure 1(b). 

To overcome this problem, we use the fact that the only source of difference 
between the left and right images in Figure 1(b) is different viewpoint. The face 
itself consists of the same sub-parts in both images, and we therefore wish to 
represent the objects in terms of sub-parts and not in terms of view-specific 
sub-images. The representation using sub-parts will then be view-invariant and 
will allow invariant recognition. 

To represent sub-parts in an invariant manner, one approach has been to 
use affine-invariant patches [3,2]. This approach works well in some applications 
(such as wide-baseline matching) that use nearly planar surfaces. However, for 
non-planar objects, including faces, affine transformations provide a poor appro- 
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ximation. In our experiments, methods based on affine invariant matching failed 
entirely at 45° rotation. 

In our scheme, invariance of sub-parts was achieved using multiple templates, 
by grouping together the images of the same object part under different viewing 
conditions. For example, to represent the ‘eye’ part in Figure 1(b), the two sub- 
images of the eye region shown in the figure are grouped together to form an 
‘extended eye fragment’. Using this extended fragment, the eye is detected in an 
image if either the frontal or the side eye template are detected. Typically, the 
frontal template would be found in frontal images and the profile template would 
be found in profile images. Consequently, at the level of extended fragments, the 
representation becomes invariant, as illustrated schematically in Figure 1(c). 

Note that the scheme uses only multiple-template representation for object 
parts , not for entire objects as was used by previous multiple template algo- 
rithms [5,12,13]. This has a number of significant advantages over previous sche- 
mes. First, multiple examples of each object are needed only in the training 
phase, when extended fragments are created. Since extended fragments are both 
view-invariant and capable of representing novel objects of the same class, view- 
invariant representation of novel objects is obtained from a single image. This is 
a significant advantage over previous multiple- views schemes, where many views 
for each novel objects were required. Second, the extended fragments represen- 
tation is more efficient in terms of memory than previous multiple-template 
schemes, because the templates required for fragment representation are much 
smaller that the entire object images. The reduction in space requirements also 
reduces matching time, as there is no need to perform matching of a large collec- 
tion of full-size images. Finally, object parts generalize better to novel viewing 
conditions than entire objects (see section 6). Therefore, fewer templates per 
extended fragment will be required to cover a given range of viewing directions 
compared with matching images of entire objects. 

We have implemented the scheme outlined above and applied it to recognize 
face images from two widely separated viewing directions: frontal and 60° profile, 
called below a ‘side view’. Note that this range is wide enough to undermine 
schemes that were not specifically designed for view-invaraince, such as [9,10]. 
As discussed in section 6, generalization of our algorithm to handle any viewing 
direction is straightforward. The following sections describe in more detail the 
different stages in the algorithm. 



3.1 The Extraction of Extended Fragments 

In our training, extended fragments were extracted from images of 100 subjects 
from the FERET database [17]. The images were low-pass filtered and down- 
sampled to size 60 x 40 pixels. 

To form extended fragments, the multiple template representation of object 
parts must be obtained. To deal with two separate views, a set of sub-image pairs 
must be provided, where in each pair one sub-image will be a view of some face 
part in frontal orientation and the other sub-image will be a view of the same 
face part in the side view. For this, correspondence must be established between 
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(b) Example fragments 



Fig. 2. Illustration of extended fragments selection, (a) A sample training sequence, 
(b) Some extended fragments that were selected automatically from this and similar 
training sequences. Each pair constitutes an extended fragment; the left part is the 
fragment in frontal images, the right part is the same fragment in side images. Top 
part: sub-images with fragment shapes delineated. Bottom part: the same fragments 
are shown outside the corresponding sub-images. 



face areas in frontal and side images. This is a standard task in computer vision 
that is similar to optical flow computation. 

We used the KLT (Kanade-Lucas-Tomasi) algorithm [18] to automatically 
establish correspondences between frontal and side views. The KLT algorithm 
selects points in the initial image that can be tracked reliably, and uses a simple 
gradient search to follow the selected points in subsequent images. The tracking 
improves when the differences between successive images are small. Therefore, 
using short video segments of rotating faces produces more reliable results. We 
have used in the training stage the images from the FERET database [17]. A 
training sequence contained three intermediate views in addition to the frontal 
and size views; an example of such a sequence is shown in Figure 2(a). 

After tracking was completed, the intermediate views were discarded. The 
correspondences obtained by KLT can then be used to associate together the 
views of the same face part under different poses. This is obtained by selecting 
a sub-image in one view (e.g. frontal), and using the tracked points to identify 
the corresponding sub-image from the other (side) view. The sub-images are 
polygons defined by a subset of matching points. In particular, we have used tri- 
angular sub-images defined by corresponding triplets of points. Matching pairs 
of triangles (one from frontal, another from side view) were grouped together 
and formed the pool of candidate extended fragments. We have also tried to in- 
terpolate between the points tracked by KLT to obtain dense correspondences, 
and use matching regions of arbitrary shape (Figure 2(b)). However, the diffe- 
rence in performance was only marginal. Therefore, triangular fragments were 
used throughout most of our tests. 

3.2 Selection Using Mutual Information and Max-Min Algorithm 

During learning, extended fragments were extracted from the 100 training se- 
quences. The number of all possible fragments was about 100000 per sequence; 
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as a result, learning becomes time-consuming. However, most of these candi- 
date fragments are not informative, and it is possible to reduce the size of the 
pool based on the fragments size. It was shown in [15] that most informative 
fragments have intermediate size. In our experiments (see section 6), fragments 
that were smaller than 6% of the object area or larger than 20% were uninfor- 
mative. By excluding from consideration candidate fragments outside these size 
constraints, the number of candidate extended fragments was reduced to about 
1000 per training sequence. Since calculating the fragment’s size is much simpler 
than evaluating its mutual information, significant savings in computation were 
obtained. 

Extracting extended fragments from all 100 sequences results in a pool of can- 
didate fragments of size around 100000. This set still contains many redundant 
or uninformative extended fragments. Therefore, the next stage in the feature 
extraction process is to select a smaller subset of fragments that are most useful 
for the task of recognition. 

This selection was obtained based on maximizing the information supplied 
by the extended fragments for view-invariant recognition. The use of mutual in- 
formation for feature selection is motivated by both theoretical and experimental 
results. Successful classification reduces the initial uncertainty (entropy) about 
the class. The classification error is bounded by the residual entropy (Fano’s 
inequality [19]), and this entropy is minimized when I(C;F), the mutual infor- 
mation between the class and the set F of fragments, is maximal. In practice, 
selecting features based on maximizing mutual information leads to improved 
classification compared with less informative features [14] . We explain below the 
procedure for selecting the most informative extended fragments. 

This first step of the selection procedure derives for each extended fragment 
a measure of mutual information. Mutual information between the class C and 
fragment F is given by 



I(C;F) = J2p(C 

cj 



c,F = f) log 



P (C = c,F = f) 
p(C = c)p{F = f) ' 



(2) 



By measuring the frequencies of detecting F inside different classes c, we can 
evaluate the mutual information of a fragment from the training data. 

The next step is to select a bank of n fragments B = {Ej, . . . , F n } with the 
highest mutual information about the class C; B = argma x/(C; B). Evaluating 
the mutual information with respect to the joint distribution of many variables 
is impractical, therefore some approximation must be used. A natural approach 
is to use greedy iterative optimization. The selection process is initialized by 
selecting the extended fragment Fi with the highest mutual information. Frag- 
ments are then added one by one, until the gain in mutual information is small, 
or until a limit on the bank size n is reached. To expand a siz e-k fragment bank 
B = {Ei, . . . , Fk} to size k+ 1, a new fragment Fk+i must be selected that will 
add the maximal amount of new information to the bank. The conditional mutual 
information between F^+i and the class given the current fragment bank must 
therefore be maximized: Efc +1 = argmax/(C'; Fk +1 \B). Estimating I(C\ Fk+i\B) 
still depends on multiple fragments. The term I{C\ Ft~+\ \B) can be approxima- 
ted by min^gs I{C\ Fk+i\Fj). This term contains just two fragments and can 
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be computed efficiently from the training data. The approximation essentially 
takes into account correlations between pairs of fragments, but not higher order 
interactions. It makes sure that the new fragment Tfc+i is informative, and that 
the information it contributes is not contained in any of the previously selected 
fragments. The overall algorithm for selection can be summarized as: 

Fi = argmax/((7; F); (3) 

F 

F k + 1 = arg max min /(C; F\Fi). (4) 

F i 

The second stage determines the contribution of a fragment F by finding the 
most similar fragment already selected (this is the min stage) and then selects 
the new fragment with the largest contribution (the max stage). The full com- 
putation is therefore called the max-min selection. 




Fig. 3. An example of recognition by extended fragments, (a) Top row: a novel frontal 
face (left image), together with the same face, but at 60° orientation, among distractor 
faces. Only a few examples are shown, the actual testing set always contained 99 
distractors. The side view images are arranged according to their similarity to the 
target image computed by the extended fragments algorithm. The image selected as 
the most similar is the correct answer. Below each distractor, one of the extended 
fragments that helped in the identification task is shown. Each of these fragments was 
detected either only in the frontal face, or only in the distractor side view above the 
fragment, providing evidence that the two faces are different. The numbers next to each 
face show its rank as given by the view-based PC A scheme, (b) Same as (a), without 
the fragments shown. The test faces were frontal in this case and the target face was 
at 60°. 



During recognition, fragment detection is performed by computing the ab- 
solute value of the normalized cross-correlation at every image location, and 
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selecting the location with the highest correlation. The maximal correlation, 
which is a continuous value in the range [0,1], can be used in the recognition 
process. We used, however, a simplified scheme in which the feature value was 
binarized. A fragment was considered to be present, and have the value of 1 in a 
given image, if its maximal correlation in the image was above a pre-determined 
threshold. If the maximal correlation was below the threshold, the fragment was 
assigned value of 0. (Fragments whose activation is above the threshold are also 
called ‘active’ below, and those with activation below the threshold are called 
‘inactive’.) An optimal threshold was selected automatically for each fragment 
in such a way as to maximize the fragment’s mutual information with the class: 
9 = argmax/(C; Fg). Here Fg is the fragment F detected with threshold 9. 
Since extended fragments consist of several individual fragments (two in our 
case), each has a separate threshold. These thresholds can be determined by a 
straightforward search procedure. 

Examples of extended fragments selected automatically by the algorithm are 
shown in Figure 2(b). 



4 Recognition Using Extended Fragments 

In performing recognition, the system is given a single image of a novel face, 
for example, in frontal view. It is also presented with a gallery of side views of 
different faces. The task is to identify the side view image from the gallery that 
corresponds to the frontal view. 

Given the extended fragments representation, the recognition procedure is 
straightforward. The novel image is represented by the activation pattern of 
the fragment bank. This is a binary vector that specifies which of the extended 
fragments were active in the image. Similarly, activation patterns of the gallery 
images are known. SVM classifier was used to identify the side activation pattern 
that corresponds to the given frontal activation pattern. An example of the 
scheme applied to target and test images is shown in Figure 3. 

5 Results 

In this section we summarize the results obtained by the method presented above 
and compare them with other methods. The results were obtained as follows. The 
database images were divided into a training and a testing set (several random 
partitions were tried in every experiment). In the training phase, images of 100 
individuals were used. For each individual the data set contained 5 images in 
the orientations shown in Figure 2(a). This set of images was used to select 1000 
extended fragments and their optimal thresholds as described in section 3.2 and 
train the SVM classifier. 

In the testing phase, the algorithm was given a novel frontal view, called the 
target view. (All individuals in the testing and training phases were different.) 
The task was then to identify the side view of the individual shown in the novel 
frontal view. The algorithm was presented with a set (called the ‘test set’) of 100 
side views of different people, one of which was of the same individual shown in 
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(a) Recognition performance 



(b) Same as (a), magnified 



Fig. 4. Recognition results and some comparisons. The graph value at X = k shows 
the percentage of trials for which the correct classification was among the top k choices. 
Bars show standard deviation, (a) Comparison of extended fragments with PCA. (b) 
Initial portion of the plot in (a), magnified. 



the frontal view. The algorithm ranked these pictures by their similarity to the 
frontal view using the extended features as described above. When the top ranked 
picture corresponded to the target view, the algorithm correctly recognized the 
individual. 

We present our results using CMC (cumulative match characteristics) curves. 
A CMC curve value at point X = k shows the percentage of trials for which 
the correct match was among the top k matches produced by the algorithm. 
Typically, the interesting region of the curve comprises the several initial points; 
in particular, the point x = 1 on the curve corresponds to the frequency at 
which the correct view was ranked first among the 100 views, i.e. was correctly 
recognized. 

The results of our scheme are shown in Figure 4. As can be seen, side views of 
a novel person were identified correctly in about 85% of the cases following the 
presentation of a single frontal image. In order to compare this performance to 
previous schemes, we implemented the view-invariant PCA scheme of Pentland 
et al. [13], which is one of the most successful and widely used face recognition 
approaches. Our implementation was identical to the scheme as described in 
[13]. PCA performance was calculated under exactly the same conditions as used 
for our algorithm, i.e. we trained PCA on the same training images and tested 
recognition on the same images. Figure 4(a) shows the results of the comparison. 
As can be seen from the figure, this method identifies the person correctly in 60% 
of the cases. The plots in Figure 4 show the marked advantage of the present 
algorithm over PCA (the differences are significant at p < 0.01, % 2 test). 

A recognized weakness of the PCA method is that it requires precise alig- 
nment of the images. In contrast, our algorithm can tolerate significant errors 
in alignment. We have tested the sensitivity of both algorithms to alignment 
precision. To test the sensitivity of the extended fragments scheme, we fixed one 
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Fig. 5. The effect of misalignment on recognition performance. Percent correct reco- 
gnition as a function of misalignment magnitude in pixels for extended fragments and 
view- invariant PC A. 



(frontal) part of each fragment, and shifted the other (side) part in a random 
direction by progressively larger amounts from its correct position. This created 
a controlled error in the location of corresponding fragments. We tested the reco- 
gnition performance as a function of the correspondence error. The results were 
compared with a similar test applied to the PCA method, where the images were 
not precisely aligned, but had a systematic misalignment error. Figure 5 shows 
the performance of the schemes as a function of the amount of misalignment. 
(These tests were performed on a new database and with a smaller test set, the- 
refore the results are not on the same scale as in previous figures.) The task was 
to recognize one out of five faces. Note that for four-pixel shifts, corresponding 
on average to 12% of the face size, PCA performance reduces to chance level. 
In many schemes, image misalignment during learning is a significant potential 
source of errors. As seen in the figure, extended fragments are significantly more 
robust than PCA to misalignments in the learning stage. 



6 Discussion 

We described in this work a general approach to view-invariant object recogni- 
tion. The approach is based on a novel type of features, which are equivalence 
sets of corresponding sub-images, called extended fragments. The features are 
class-based and are applicable to many natural object classes. In particular, we 
have applied the approach to cars and animals with similar results. 

Despite the large number of extended fragments used, time requirements of 
the recognition stage are quite reasonable. In our tests, 1500 recognition at- 
tempts using 1000 fragments took 2-3 seconds (without optimizing the code for 
efficiency). Time requirements of the learning stage are more significant, but 
learning is performed off-line. 

One potential limitation of the extended fragments representation is that due 
to the local nature of the fragments, they might be detected in face images un- 
der inconsistent orientations (e.g. frontal fragment would be detected in profile 
view), which might lead to a decrease in performance. However, in experiments 
where fragments were restricted to be detected only in the relevant orientati- 
ons, no performance improvement was observed. The implication is that reliable 
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recognition can be based on the activation of the features themselves, without 
explicitly testing for view consistency between different features. 

In the feature selection process, the size of preferred features was not fixed, 
but free to change within a general range (set in the simulations between 6-20% 
of object size). In other schemes, features are often required to be of a fixed size, 
with some schemes using small local features, and others using global object 
features. It therefore became of interest to test the size of the most informative 
features, and to compare the results with alternative approaches. 

Previous studies [15] have shown that useful fragments are typically of in- 
termediate size. To investigate this further, we tested the effect of feature size 
on view-invariant recognition, using a new database of size 50; 40 images for 
training and 10 for testing. This size of the database allowed selecting exten- 
ded fragments of various sizes and comparing their performance. We found that 
when the extended fragments were of size between 6% and 20% of the object 
area, the test faces were correctly recognized in 89 ± 10% of the cases. When 
the selected fragments were constrained to be smaller (below 3% of the object 
size), the performance dropped to 67 ± 12%, and when the fragments were con- 
strained to be large (above 20%), the performance dropped to 75 ± 10%. Both 
results were highly significant (t test, p < 0.01). This can be contrasted with 
approaches such as [20,21], that use small local features, or [11,13], where the 
features are global. The optimal features extracted were also found to vary in 
size, to represent object features of different dimensions. This can be contrasted 
with approaches such as [22] , where the features are constrained to have a fixed 
size. This flexibility in feature size is important for maximizing the fragments 
mutual information and classification performance. 

In our testing, features were extracted for two orientations - frontal and 60° 
side view. A complete recognition model should be able, however, to handle a 
full range of orientations (from left to right profile and at different elevations). 
As mentioned above, the scheme can be extended to handle a full range of 
orientations by including views from a set of representative viewing directions 
in each extended fragment. Results of a preliminary experiment suggest that 15 
views are sufficient to cover the entire range of views, or nine views if the bilateral 
symmetry of the face is used. These requirements can be compared with typical 
view interpolation schemes such as [5]. This scheme requires the use of 15 views 
to achieve recognition within a restricted range of —30° to 30° horizontally and 
—20° to 20° vertically. More importantly, it requires all 15 views for each novel 
face. In contrast, the extended fragments scheme requires 15 views for training 
only; in testing, recognition of novel faces is performed from a single view. 

The proposed approach can be extended to handle sources of variability such 
as illumination and facial expression. This can be performed within the general 
framework by adding the necessary templates to each extended fragment. There 
are indications [23] that compensating for illumination changes will be possible 
using a reasonable number of templates. Facial expressions often involve a limited 
area of the face, and therefore affect only a small number of fragments. The full 
size of equivalence sets of extended fragments required to perform unconstrained 
recognition is a subject for future work. 
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Abstract. Segmentation and blind restoration are both classical pro- 
blems, that are known to be difficult and have attracted major research 
efforts. This paper shows that the two problems are tightly coupled and 
can be successfully solved together. Mutual support of the segmentation 
and blind restoration processes within a joint variational framework is 
theoretically motivated, and validated by successful experimental results. 
The proposed variational method integrates Mumford-Shah segmenta- 
tion with parametric blur-kernel recovery and image deconvolution. The 
functional is formulated using the T-convergence approximation and is 
iteratively optimized via the alternate minimization method. While the 
major novelty of this work is in the unified solution of the segmentation 
and blind restoration problems, the important special case of known blur 
is also considered and promising results are obtained. 



1 Introduction 

Image analysis systems usually operate on blurred and noisy images. The stan- 
dard model g = h * f + n is applicable to a large variety of image corruption 
processes that are encountered in practice. Here h represents an (often unknown) 
space- invariant blur kernel (point spread function), n is the noise and f is an 
ideal version of the observed image g. 

The two following problems are at the basis of successful image analysis. 
(1) Can we segment g in agreement with the structure of /? (2) Can we estimate 
the blur kernel h and recover /? Segmentation and blind image restoration are 
both classical problems, that are known to be difficult and have attracted major 
research efforts, see e.g. [16,3,8,9]. 

Had the correct segmentation of the image been known, blind image restora- 
tion would have been facilitated. Clearly, the blur kernel could have then been 
estimated based on the smoothed profiles of the known edges. Furthermore, 
denoising could have been applied to the segments without over-smoothing the 
edges. Conversely, had adequate blind image restoration been accomplished, suc- 
cessful segmentation would have been much easier to achieve. Segmentation and 
blind image restoration are therefore tightly coupled tasks: the solution of either 
problem would become fairly straightforward given that of the other. 
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This paper presents an integrated framework for simultaneous segmentation 
and blind restoration. As will be seen, strong arguments exist in favor of con- 
straining the recovered blur kernel to parameterized function classes, see e.g. [4]. 
Our approach is presented in the context of the fundamentally important model 
of isotropic Gaussian blur, parameterized by its (unknown) width. 

2 Fundamentals 

2.1 Segmentation 

The difficulty of image segmentation is well known. Successful segmentation re- 
quires top-down flow of models, concepts and a priori knowledge in addition to 
the image data itself. In their segmentation method, Mumford and Shah [10] 
introduced top-down information via the preference for piecewise-smootlr seg- 
ments separated by well-behaved contours. Formally, they proposed to minimize 
a functional that includes a fidelity term, a piecewise-smoothness term, and an 
edge integration term: 

T{f,K)= l -jj y f~g) 2 dA + 

+ (3 f |V/| 2 cL4 + a [ da (1) 

J n\K Jk 

Here K denotes the edge set and f K da is the total edge length. The coefficients 
a and /3 are positive regularization constants. The primary difficulty in the 
minimization process is the presence of the unknown discontinuity set K in 
the integration domains. 

The T-convergence framework approximates an irregular functional F(f, K ) 
by a sequence T t (f,K) of regular functionals such that 

lini ^ e (f,K)=T(f,K) 

e — >0 

and the minimizers of T t approximate the minimizers of T . Ambrosio and Torto- 
relli [1] applied this approximation to the Mumford-Shah functional, and repre- 
sented the edge set by a characteristic function (1 — \k) which is approximated 
by an auxiliary function v, i.e., v(x) ~ 0 if x £ K and v(x) « 1 otherwise. The 
functional thus takes the form 

fe(f,v)=l [ ( f-g) 2 dA + (3 f v 2 \X7f\ 2 dA + 
z Jo Jn 

+ a Jo( e i Vi; i 2 + h^ v ~ 1 ^') dA " 

Richardson and Mitter [12] extended this formulation to a wider class of fun- 
ctionals. Discretization of the Mumford-Shah functional and its T-convergence 
approximation is considered in [5]. Additional perspectives on variational seg- 
mentation can be found in Vese and Chan [19] and in Samson et al [15]. 
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Simultaneous segmentation and restoration of a blurred and noisy image has 
recently been presented in [7]. A variant of the Mumford-Shah functional was 
approached from a curve evolution perspective. In that work, the discontinuity 
set is limited to an isolated closed curve in the image and the blurring kernel h 
is assumed to be a priori known. 

2.2 Restoration 

Restoration of a blurred and noisy image is difficult even if the blur kernel h is 
known. Formally, finding / that minimizes 

\\h * / — 5lll 2 (r2) (3) 

is an ill-posed inverse problem: small perturbations in the data may produce un- 
bounded variations in the solution. In Tikhonov regularization [18], a smoothing 
term j Q |V/| 2 (L4 is added to the fidelity functional (3). In image restoration, 
Tikhonov regularization leads to over-smoothing and loss of important edge in- 
formation. For better edge preservation, the Total Variation approach [13,14] 
replaces L 2 smoothing by L\ smoothing. The functional to be minimized is thus 

Hf,h) = \\\h*f-g\\ 2 l 2 {q) + pj\Vf\dA. (4) 

This nonlinear optimization problem can be approached via the half-quadratic 
minimization technique [2]. An efficient alternative approach, based on the lagged 
diffusivity fixed point scheme and conjugate gradients iterations, was suggested 
by Vogel and Oman [20]. 

Image restoration becomes even more difficult if the blur kernel h is not 
known in advance. In addition to being ill-posed with respect to the image, 
the blind restoration problem is ill-posed in the kernel as well. To illustrate one 
aspect of this additional ambiguity, suppose that h represents isotropic Gaussian 
blur, with variance er 2 = 2 1: 

1 _ x 2 + y 2 

ht = - — e 4f 

47 it 

The convolution of two Gaussian kernels is a Gaussian kernel, the variance of 
which is the sum of the two originating variances: 

h tl * h t2 = h tl +t a ■ (5) 

Assume that the true t of the blur kernel is t = T, so g = hr * f ■ The fidelity 
term (3) is obviously minimized by / and hx- However, according to Eq. 5, g 
can also be expressed as 

g=h tl *h t2 *f V(ti+t 2 ) = T . 

Therefore, an alternative hypothesis, that the original image was ht 2 * / and the 
blur kernel was /q, , minimizes the fidelity term just as well. This exemplifies a 
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Fig. 1 . Blind image restoration using the method of [6]. Top-left: Original. Top-right: 
Blurred using an isotropic Gaussian kernel (<r = 2.1). Bottom-left: Recovered image. 
Bottom-right: Reconstructed kernel. 



fundamental ambiguity in the division of the apparent blur between the reco- 
vered image and the blur kernel, i.e. , that the scene itself might be blurred. For 
meaningful image restoration, this hypothesis must be rejected and the largest 
possible blur should be associated with the blur kernel. It can be achieved by 
adding a kernel-smoothness term to the functional. 

Blind image restoration with joint recovery of the image and the kernel, and 
regularization of both, was presented by You and Kavelr [21], followed by Chan 
and Wong [6] . Chan and Wong suggested to minimize a functional consisting of 
a fidelity term and total variation {L\ norm) regularization for both the image 
and the kernel: 



Hf:h) = l\\h*f-g\\ 2 v {n) + a i / \Vf\dA + a 2 [ \S7h\dA . (6) 

z Jo Jo 

Much can be learned about the blind image restoration problem by studying 
the characteristics and performance of this algorithm. Consider the images shown 
in Fig. 1. An original image (upper left) is degraded by isotropic Gaussian blur 
with a — 2.1 (top-right). Applying the algorithm of [6] (with a\ = 10 -4 and 
a 2 = 10 -4 ) yields a recovered image (bottom-left) and an estimated kernel 
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(bottom-right) . It can be seen that the identification of the kernel is inadequate, 
and that the image restoration is sensitive to the kernel recovery error. 

To obtain deeper understanding of these phenomena, we plugged the original 
image f and the degraded image g into the functional (6), and carried out 
minimization only with respect to h. The outcome was similar to the kernel 
shown in Fig. 1 (bottom-right). This demonstrates an excessive dependence of 
the recovered kernel on the image characteristics. At the source of this problem is 
the aspiration for general kernel recovery: the algorithm of [6] imposes only mild 
constraints on the shape of the reconstructed kernel. This allows the distribution 
of edge directions in the image to have an influence on the shape of the recovered 
kernel, via the trade-off between the fidelity and kernel smoothness terms. For 
additional insight see Fig. 2. 

Facing the ill-posedness of blind restoration with a general kernel, two ap- 
proaches can be taken. One is to add relevant data; the other is to constrain 
the solution. Recent studies have adopted one of these two approaches, or both. 
In [17], the blind restoration problem is considered within a multichannel fra- 
mework, where several input images can be available. 

In many practical situations, the blurring kernel can be modeled by the 
physics/optics of the imaging device and the set-up. The blurring kernel can then 
be constrained and described as a member in a class of parametric functions. 
This constraint was exploited in the direct blind deconvolution algorithm of [4] . 
In [11], additional relevant data was introduced via learning of similar images 
and the blur kernel was assumed to be Gaussian. 




Fig. 2. Experimental demonstration of the dependence of the recovered kernel on 
the image characteristics in [6]. Each of the two synthetic bar images (top-row) was 
smoothed using an isotropic Gaussian kernel, and forwarded to the blind restoration 
algorithm of [6] (Eq. 6). The respective recovered kernels are shown in the bottom row. 
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3 Coupling Segmentation with Blind Restoration 



The observation, discussed in the introduction, that segmentation and blind re- 
storation can be mutually supporting, is the fundamental motivation for this 
work. We present an algorithm, based on functional minimization, that iterati- 
vely alternates between segmentation, blur identification and restoration. 

Concerning the sensitivity of general kernel recovery, we observe that large 
classes of practical imaging problems are compatible with reasonable constraints 
on the blur model. For example, Carasso [4] described the association of the 
Gaussian case with diverse applications such as undersea imaging, nuclear medi- 
cine, computed tomography scanners, and ultrasonic imaging in nondestructive 
testing. 

In this work we integrate Mumford-Slrah segmentation with blind deconvolu- 
tion of isotropic Gaussian blur. This is accomplished by extending the Mumford- 
Slrah functional and applying the T-convergence approximation as described 
in [1] . The observed image g is modeled as g = h a * f + n where h a is an isotro- 
pic Gaussian kernel parameterized by its width cr, and n is white Gaussian noise. 
The objective functional used is 

Ft{f,h a ,v) = \ [ (h,?* f - g) 2 dA + (3 f v 2 \Wf\ 2 dA + 

1 J a J a 

+ aj ^e|Vu| 2 + ^jdA + jJ^ \S7h a \ 2 dA (7) 

The functional depends on the functions / (ideal image) and v (edge integration 
map) , and on the width parameter a of the blur kernel h a . The first three terms 
are similar to the T-convergence formulation of the Mumford-Slrah functional, 
as in (2). The difference is in the replacement of / in the fidelity term by the 
degradation model h a * f. The last term stands for the regularization of the 
kernel, necessary to resolve the fundamental ambiguity in the division of the 
apparent blur between the recovered image and the blur kernel. In the sequel 
it is assumed that the image domain 17 is a rectangle in R 2 and that image 
intensities are normalized to the range [0, 1]. 

Minimization with respect to / and v is carried out using the Euler-Lagrange 
equations (8) and (9). The differentiation by cr (10) minimizes the functional with 
respect to that parameter. 



dT e 

Ida 



^ = 2/3u|V/| 2 + a 



v — 1 
2e 



— 2eaV v = 0 



W 



= (h a * f -g)* h„{-x, - y ) - 2/3Div(u 2 V/) = 0 



la L 



(ha * f - g) 



dha 

Ida 



* / + 7 



2 hi ( x 2 + y 2 



-4 



dA = 0 



(8) 

(9) 

(10) 



Studying the objective functional (7), it can be seen that it is convex and 
lower bounded with respect to any one of the functions /, v or h a if the other two 
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functions are fixed. For example, given v and a, T t is convex and lower bounded 
in /. Therefore, following [6], the alternate minimization (AM) approach can 
be applied: in each step of the iterative procedure we minimize with respect to 
one function and keep the other two fixed. The discretization scheme used was 
CCFD (cell-centered finite difference) [20] . This leads to the following algorithm: 

Initialization: / = g 1 a = £ i, v = 1, <r prev 1 
while (| a prev — cr| > £ 2 ) repeat 

1. Solve the Helmholtz equation for v 

(2/3|V/| 2 + ^-2aeV 2 )u=^ 

2. Solve the following linear system for / 

{K *f-g)* h a (-x, -y) - 2/3Div(u 2 V/) = 0 

3. Set a prev = a , and find a such that (Eq. 10) 

fGo 

da 

Here £\ and £2 are small positive constants. Both steps 1 and 2 of the algorithm 
call for a solution of a system of linear equations. Step 1 was implemented using 
the Generalized Minimal Residual (GMRES) algorithm. In step 2, a symmetric 
positive definite operator is applied to f(x,y). Implementation was therefore via 
the Conjugate Gradients method. In step 3, the derivative of the functional with 
respect to a was analytically determined, and its zero crossing was found using 
the bisection method. All convolution procedures were performed in the Fourier 
Transform domain. The algorithm was implemented in MATLAB environment. 

4 Special Case: Known Blur Kernel 

If the blur kernel is known, the restriction to Gaussian kernels is no longer 
necessary. In this case, the kernel-regularization term in the objective functional 
(7) can be omitted. Consequently, the algorithm can be simplified by skipping 
step 3 and replacing the stopping criterion by a simple convergence measure. 

The resulting algorithm for coupled segmentation and image restoration is 
fast, robust and stable. Unlike [7], the discontinuity set is not restricted to isola- 
ted closed contours. Its performance is exemplified in Fig. 3. The top- left image 
is a blurred and slightly noisy version of the original 256 x 256 Lena image (not 
shown). The blur kernel was a pill-box of radius 3.3. The top-right image is the 
reconstruction obtained using the Matlab’s Image Processing Toolbox adapta- 
tion of the Lucy-Richardson algorithm (deconvlucy). The bottom-left image is 
the outcome of the proposed method; the bottom-right image shows the asso- 
ciated edge map v determined by the algorithm (/3 = 1, a = 10 -8 , e = 10 -5 , 
4 iterations) . Computing time was 2 minutes in interpreted MATLAB on a 2GHz 
PC. The superiority of the suggested method is clear. 
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Fig. 3. The case of a known blur kernel. Top-left: Corrupted image. Top-right: Re- 
storation using the Lucy-Richardson algorithm (MATLAB: deconvlucy). Bottom-left: 
Restoration using the suggested method. Bottom-right: Edge map produced by the 
suggested method. 



5 Results: Segmentation with Blind Restoration 

Consider the example shown in Fig. 4. The top-left image was obtained by 
blurring the original 256 x 256 Lena image (not shown) with a Gaussian kernel 
with a = 2.1. The proposed method for segmentation and blind restoration was 
applied, with (3 = 1, 7 = 50, a = 10~ 7 and e = 10 -4 . The initial value of a was 
£1 = 0.5 and the convergence tolerance was taken as £2 = 0.001. Convergence 
was achieved after 24 iterations; the unknown width of the blur kernel was 
estimated to be a = 2.05, which is in pleasing agreement with its true value. 
The reconstructed image is shown top-right, and the associated edge map is 
presented at the bottom of Fig. 4. Compare to Fig. 1. 

The top-left image in Fig. 5 is a 200 x 200 gray level image, synthetically 
blurred by an isotropic Gaussian kernel (a = 2.1) and additive white Gaussian 
noise (SNR=44dB). Restoration using [6] («i = 10 -4 , a-i = 10 -4 ) is shown 
top-right. The reconstruction using the method suggested in this paper (/? = 1, 
a = 10 -6 , 7 = 20, e = 10 -3 ) is shown bottom-left, and the associated edge 
map v is shown bottom-right. The number of iterations to convergence was 18, 
and the estimated width of the blur kernel was 1.9. The convergence process is 
illustrated in Fig. 6. 
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Fig. 4. Segmentation and blind restoration. Top-left: Blurred image (a = 2.1). Top- 
right: Blind reconstruction using the proposed algorithm. Bottom: Edge map v produ- 
ced by the suggested method. Compare to Fig. 1. 



The last example refers to actual defocus blur. The top-left image in Fig. 7 
is a 200 x 200 rescaled part of an image obtained with deliberate defocus blur, 
using a Canon VC-C1 PTZ video communication camera and SGI 02 analog 
video acquisition hardware. The shape and size of the actual defocus blur were 
not known to us; they certainly deviate from the isotropic Gaussian model. 
The top-right image in Fig. 7 shows the blind restoration result of [6] (oq = 
10 -4 , c*2 = 10~ 6 ). At the bottom-left is the reconstruction using the method 
suggested in this paper ((3 = 1, a = 10 -2 , 7 = 100, e = 0.1). Convergence was 
achieved within 7 iterations, and the blur kernel width was estimated to be 1.68. 
The edge map v is shown bottom-right. The quality of this result demonstrates 
the applicability of the proposed method to real images and its robustness to 
reasonable deviations from the Gaussian case. In all our experiments /3 = 1, 
and the best value of 7, controlling the deconvolution level, was in the range 
20 < 7 < 100. The values of e and a had to be increased in the presence of noise. 

6 Discussion 

This paper validates the hypothesis that the challenging tasks of image seg- 
mentation and blind restoration are tightly coupled. Mutual support of the seg- 
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Fig. 5. Segmentation and blind restoration. Top-left: Blurred (a = 2.1) image with 
slight additive noise. Top-right: Restoration using the method of [6]. Bottom-left: Resto- 
ration using the suggested method. Bottom-right: Edge map produced by the suggested 
method. 




Fig. 6. The convergence of the estimated width o of the blur kernel as a function of 
the iteration number in the blind recovery of the coin image. 
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Fig. 7. Segmentation and blind restoration of unknown defocus blur. Top-left: Blurred 
image. Top-right: Restoration using the method of [6]. Bottom-left: Restoration using 
the suggested method. Bottom-right: Edge map produced by the suggested method. 



mentation and blind restoration processes within an integrative framework is 
demonstrated. 

Inverse problems in image analysis are difficult and often ill-posed. This me- 
ans that searching for the solution in the largest possible space is not always 
the best strategy. A-priori knowledge should be used, wherever possible, to limit 
the search and constrain the solution. In the context of pure blind restoration, 
Carasso [4] analyzes previous approaches and presents convincing arguments in 
favor of restricting the class of blurs. 

Along these lines, in this paper the blur kernels are constrained to the class of 
isotropic Gaussians parameterized by their width. This is a sound approximation 
of physical kernels encountered in diverse contexts [4] . The advantages brought 
by this restriction are well-demonstrated in the experimental results that we 
provide. We plan to extend this approach to parametric kernel classes to which 
the Gaussian approximation is inadequate, in particular, motion blur. 

While the major novelty in this work is in the unified solution of the segmen- 
tation and blind restoration problems, we have obtained valuable results also in 
the case of known blur, see Fig. 3 (bottom-left). Note that if the blur is known, 
the restriction to the Gaussian case is no longer necessary. 
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Abstract. In this paper, we present a vision system for object recognition in aerial 
images, which enables broader mission profiles for Micro Air Vehicles (MAVs). 
The most important factors that inform our design choices are: real-time con- 
straints, robustness to video noise, and complexity of object appearances. As such, 
we first propose the HSI color space and the Complex Wavelet Transform (CWT) 
as a set of sufficiently discriminating features. For each feature, we then build 
tree-structured belief networks (TSBNs) as our underlying statistical models of 
object appearances. To perform object recognition, we develop the novel multis- 
cale Viterbi classification (MSVC) algorithm, as an improvement to multiscale 
Bayesian classification (MSBC). Next, we show how to globally optimize MSVC 
with respect to the feature set, using an adaptive feature selection algorithm. Fi- 
nally, we discuss context-based object recognition, where visual contexts help to 
disambiguate the identity of an object despite the relative poverty of scene detail 
in flight images, and obviate the need for an exhaustive search of objects over 
various scales and locations in the image. Experimental results show that the pro- 
posed system achieves smaller classification error and fewer false positives than 
systems using the MSBC paradigm on challenging real-world test images. 



1 Introduction 

We seek to improve our existing vision system for Micro Air Vehicles (MAV s) [1, 2, 3] 
to enable more intelligent MAV mission profiles, such as remote traffic surveillance 
and moving-object tracking. Given many uncertain factors, including variable lighting 
and weather conditions, changing landscape and scenery, and the time-varying on-board 
camera pose with respect to the ground, object recognition in aerial images is a challen- 
ging problem even for the human eye. Therefore, we resort to a probabilistic formulation 
of the problem, where careful attention must be paid to selecting sufficiently discrimi- 
nating features and a sufficiently expressive modeling framework. More importantly, 
real-time constraints and robustness to video noise are critical factors that inform the 
design choices for our MAV application. 

Having experimented with color and texture features [3], we conclude that both color 
and texture clues are generally required to accurately discriminate object appearances. 
As such, we employ both the HSI color space, for color representation, and also the 
Complex Wavelet Transform (CWT), for multi-scale texture representation. In some 
cases, where objects exhibit easy-to-classify appearances, the proposed feature set is 
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not justifiable in light of real-time processing constraints. Therefore, herein, we propose 
an algorithm for selecting an optimal feature subspace from the given HSI and CWT 
feature space that considers both correctness of classification and computational cost. 

Given this feature set, we then choose tree- structured belief networks (TSBNs) [4], 
as underlying statistical models to describe pixel neighborhoods in an image at varying 
scales. We build TSBNs for both color and wavelet features, using Pearl’s message 
passing scheme [5] and the EM algorithm [6]. Having trained TSBNs, we then proceed 
with supervised object recognition. In our approach, we exploit the idea of visual contexts 
[7], where initial identification of the overall type of scene facilitates recognition of 
specific objects/structures within the scene. Objects (e.g., cars, buildings), the locations 
where objects are detected (e.g., road, meadow), and the category of locations (e.g., sky, 
ground) form a taxonomic hierarchy. Thus, object recognition in our approach consists of 
the following steps. First, sky/ground regions in the image are identified. Second, pixels 
in the ground region 1 are labeled using the learned TSBNs for predefined locations (e.g., 
road, forest). Finally, pixels of the detected locations of interest are labeled using the 
learned TSBNs for a set of predefined objects (e.g., car, house). 

To reduce classification error (e.g., “blocky segmentation”), which arises from the 
fixed-tree structure of TSBNs, we develop the novel multiscale Viterbi classification 
(MSVC) algorithm, an improved version of multiscale Bayesian classification (MSBC) 
[8,9]. In the MSBC approach, image labeling is formulated as Bayesian classification 
at each scale of the tree model, separately; next, transition probabilities between nodes 
at different scales are learned using the greedy classification-tree algorithm, averaging 
values over all nodes and over all scales; finally, it is assumed that labels at a “coarse 
enough” scale of the tree model are statistically independent. On the other hand, in our 
MSVC formulation, we perform Bayesian classification only at the finest scale, fusing 
downward the contributions of all the nodes at all scales in the tree; next, transition 
probabilities between nodes at different scales are learned as histogram distributions 
that are not averaged over all scales; finally, we assume dependent class labels at the 
coarsest layer of the tree model, whose distribution we again estimate as a histogram 
distribution. 



2 Feature Space 

Our feature selection is largely guided by extensive experimentation reported in our prior 
work [3], where we sought a feature space, which spans both color and texture domains, 
and whose extraction meets our tight real-time constraints. 

We obtained the best classification results when color was represented in the HSI 
color space. Tests suggested that hue (H), intensity (/) and saturation ( S ) features were 
more discriminative, when compared to the inherently highly correlated features of the 
RGB and other color systems [10]. Also, first-order HSI statistics proved to be sufficient 
and better than the first and higher-order statistics of other color systems. 

For texture-feature extraction, we considered several filtering, model-based and sta- 
tistical methods. Our conclusion agrees with the comparative study of Randen et al. [11], 

1 Recognition of objects in the sky region can be easily incorporated into the algorithm. 
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which suggests that for problems where many textures with subtle spectral differences 
occur, as in our case, it is reasonable to assume that spectral decomposition by a filter 
bank yields consistently superior results over other texture analysis methods. Our ex- 
perimental results also suggest that it is necessary to analyze both local and regional 
properties of texture. Most importantly, we concluded that a prospective texture analysis 
tool must have high directional selectivity. As such, we employ the complex wavelet 
transform (CWT), due to its inherent representation of texture at different scales, orien- 
tations and locations [12], The CWT’s directional selectivity is encoded in six bandpass 
subimages of complex coefficients at each level, coefficients that are strongly oriented 
at angles ±15°, ±45°, ±75°. Moreover, CWT coefficient magnitudes exhibit the fol- 
lowing properties [13, 14] : i) multi-resolutional representation, ii) clustering , and iii) 
persistence (i.e. propagation of large/small values through scales). 

Computing CWT coefficients at all scales and forming a pyramid structure from HSI 
values, where coarser scales are computed as the mean of the corresponding children, we 
obtain nine feature trees. These feature structures naturally give rise to TSBN statistical 
models. 



3 Tree- Structured Belief Networks 



So far, two main types of prior models have been investigated in the statistical image 
modeling literature - namely, noncausal and causal Markov random fields (MRF). The 
most commonly used MRF model is the tree-structured belief network (TSBN) [15, 
14,8,9, 16]. A TSBN is a generative model comprising hidden, X, and observable, 
Y, random variables (RVs) organized in a tree structure. The edges between nodes, 
representing X, encode Markovian dependencies across scales, whereas F’s are assumed 
mutually independent given the corresponding X’s, as depicted in Figure 1. Herein, 
we enable input of observable information, Y, also to higher level nodes, preserving 
the tree dependences among hidden variables. Thus, Y at the lower layers inform the 
belief network on the statistics of smaller groups of neighboring pixels (at the lowest 
level, one pixel), whereas Y at higher layers represent the statistics of larger areas in 
the image. Hence, we enforce the nodes of a tree model to represent image details 





Fig. 1 . Differences in TSBN models: (a) observable variables at the lowest layer only; (b) our 
approach: observable variables at all layers. Black nodes denote observable variables and white 
nodes represent hidden random variables connected in a tree structure. 




Towards Intelligent Mission Profiles of Micro Air Vehicles 



181 



at various scales. 2 Furthermore, we assume that features are mutually independent, 
which is reasonable given that wavelets span the feature space using orthogonal basis 
functions. Thus, our overall statistical model consists of nine mutually independent trees 
Tf, f £ P = {± 15 °, ± 45 °, ± 75 °, H, S, I}. 

In supervised learning problems, as is our case, a hidden RV, assigned to a tree 
node i, i £ 7}, represents a pixel label, k, which takes values in a pre-defined set of 
image classes, C. The state of node i is conditioned on the state of its parent j and is 
specified by conditional probability tables, PW, \/i,j £ 7}, 7k,l £ C. It follows that 
the joint probability of all hidden RVs, X = {a;, }, can be expressed as 

p{x) =n n p if ■ a) 

i,jGTf k,l£C 

We assume that the distribution of an observable RV, j/, : , depends solely on the node 
state, Xi. Consequently, the joint pdf of Y = {y,} is expressed as 

P(Y\X) =nn v{Vi\xi = k, Of) , (2) 

ieTf kec 

where p(yi\xi = k, Of) is modeled as a mixture of M Gaussians, 3 whose parameters 
are grouped in Of. In order to avoid the risk of overfitting the model, we assume that 
the O' s are equal for all i at the same scale. Therefore, we simplify the notation as 
p{yi\xi = k, Of)=p(yi\xi). Thus, a TSBN is fully specified by the joint distribution of X 
and Y given by 



P(X,Y)= n n p{yi\xi)P?f . (3) 

i,j€.Tf k,l£C 

Now, to perform pixel labeling, we face the probabilistic inference problem of com- 
puting the conditional probability P(X\Y). In the graphical-models literature, the best- 
known inference algorithm forTSBNs is Pearl’s message passing scheme [5, 18]; similar 
algorithms have been proposed in the image-processing literature [15,8, 14]. Essentially, 
all these algorithms perform belief propagation up and down the tree, where after a num- 
ber of training cycles, we obtain all the tree parameters necessary to compute P(X\Y). 
Note that, simultaneously with Pearl’s belief propagation, we employ the EM algo- 
rithm [6] to learn the parameters of Gaussian-mixture distributions. Since our TSBNs 
have observable variables at all tree levels, the EM algorithm is naturally performed at 
all scales. Finally, having trained TSBNs for a set of image classes, we proceed with 
multiscale image classification. 

4 Multiscale Viterbi Classification 

Image labeling with TSBNs is characterized by “blocky segmentations,” due to their 
fixed-tree structure. Recently, several approaches have been reported to alleviate this 

2 This approach is more usual in the image processing community [8, 14.9], 

3 For large M, a Gaussian-mixture density can approximate any probability density [17]. 
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problem (e.g., [19,20]), albeit at prohibitively increased computational cost. Given the 
real-time requirements for our MAV application, these approaches are not realizable, and 
the TSBN framework remains attractive in light of its linear-time inference algorithms. 
As such, we resort to our novel multiscale Viterbi classification (MSVC) algorithm to 
reduce classification error instead. 

Denoting all hidden RVs at the leaf level L as X L , classification at the finest scale 
is performed according to the MAP rule 

X L = arg max{P(Y\X)P(X)} = argmaxg 1 '. (4) 

Assuming that the class label, x\, of node i at scale i, completely determines the distri- 
bution of y\, it follows that: 



p ( y \ x )= nn^i*!), (5) 

t—L iet 

where p(yf\xf) is a mixture of Gaussians, learned using the inference algorithms dis- 
cussed in Section 3. As is customary for TSBNs, the distribution of X e is completely 
determined by X f ~ ] at the coarser l — 1 scale. However, while, for training, we build 
TSBNs where each node has only one parent, here, for classification, we introduce a 
new multiscale structure where we allow nodes to have more than one parent. Thus, 
in our approach to image classification, we account for horizontal statistical dependen- 
cies among nodes at the same level, as depicted in Figure 2. The new multiresolution 
model accounts for all the nodes in the trained TSBN, except that it no longer forms a 
tree structure; hence, it becomes necessary to learn new conditional probability tables 
corresponding to the new edges. In general, the Maikov chain rule reads: 

p {x ) = n n • (6) 

i=L ie£ 

The conditional probability P(xj \X l ~ x ) in (6), unknown in general, must be estimated 
using a prohibitive amount of data. To overcome this problem, we consider, for each 
node i, a 3 x 3 box encompassing parent nodes that neighbor the initial parent j of i in 
the quad-tree. The statistical dependence of i on other nodes at the next coarser scale, 




>=> 




Fig. 2. Horizontal dependences among nodes at the same level are modeled by vertical dependences 
of each node on more than one parent. 
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in most cases, can be neglected. Thus, we assume that a nine-dimensional vector vj _1 , 
containing nine parents, represents a reliable source of information on the distribution 
of all class labels X f _1 for child node i at level l. Given this assumption, we rewrite 
expression (6) as 

p ( x ) = n n n ™ 

t—L iet jet-i 

Now, we can express the discriminant function in (4) in a more convenient form as 

/=ii n n piyfabPixiiVj- 1 ) ■ (8) 

t—L iet jet-i 

Assuming that our features f £ T are mutually independent, the overall maximum 
discriminant function can therefore be computed as 

9 L = II 9 / ■ ( 9 ) 

feT 

The unknown transition probabilities P{x\\v e ^ 1 ) can be learned through vector 
quantization [21], together with Pearl’s message passing scheme. After the prior proba- 
bilities of class labels of nodes at all tree levels are learned using Pearl’s belief propaga- 
tion, we proceed with instantiation of random vectors vf. For each tree level, we obtain 
a data set of nine-dimensional vectors, which we augment with the class label of the 
corresponding child node. Finally, we perform vector quantization over the augmented 
ten-dimensional vectors. The learned histogram distributions represent estimates of the 
conditional probability tables. Clearly, to estimate the distribution of a ten-dimensional 
random vector it is necessary to provide a sufficient number of training images, which 
is readily available from recorded MAV-flight video sequences. Moreover, since we are 
not constrained by the same real-time constraints during training as during flight, the 
proposed learning procedure results in very accurate estimates, as is demonstrated in 
Section 7. 

The estimated transition probabilities P(a:||t;j^ 1 ) enable classification from scale 
to scale in Viterbi fashion. Starting from the highest level downwards, at each scale, 
we maximize the discriminant function g L along paths that connect parent and children 
nodes. From expressions (8) and (9), it follows that image labeling is carried out as 

zf = argmaxJJ II II Ptfu\ x i) P ( x i\tf^) > ( 10 ) 

x< - eC fer t—L iet jet-i 

where (P 1 is determined from the previously optimized class labels of the coarser scale 
l-l. 

5 Adaptive Feature Selection 

We have already pointed out that in some cases, where image classes exhibit favorable 
properties, there is no need to compute expression (10) over all features. Below, we 
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present our algorithm for adaptive selection of the optimal feature set, T se i, from the 
initial feature set, T . 

1. Form a new empty set T sei = {0}; assign g new = 1, g a i d = 0; 

2. Compute off , V/ G T, given by (8) for xf given by (10); 

3. Move the best feature /*, for which g is maximum, from T to T se i\ 

4. Assign g new = Uf e r sel 9f> 

5. If ( g new < g 0 id) delete f* from T se i and go to step 3; 

6. Assign g 0 id, = Qnew ? 

7. If [T ^ {0}) go to step 3; 

8. Exit and segment the image using features in T sr \. 

The discriminant function, g, is nonnegative; hence, the above algorithm finds at least 
one optimal feature. Clearly, the optimization criteria above consider both correctness 
of classification and computational cost. 

6 Object Recognition Using Visual Contexts 

In our approach to object recognition, we seek to exploit the idea of visual contexts 
[7]. Having previously identified the overall type of scene, we can then proceed to 
recognize specific objects/structures within the scene. Thus, objects, the locations where 
objects are detected, and the category of locations form a taxonomic hierarchy. There are 
several advantages to this type of approach. Contextual information helps disambiguate 
the identity of objects despite the poverty of scene detail in flight images and quality 
degradation due to video noise. Furthermore, exploiting visual contexts, we obviate the 
need for an exhaustive search of objects over various scales and locations in the image. 

For each aerial image, we first perform categorization, i.e., sky /ground image seg- 
mentation. Then, we proceed with localization, i.e., recognition of global objects and 
structures (e.g., road, forest, meadow) in the ground region. Finally, in the recognized 
locations we search for objects of interest (e.g., cars, buildings). To account for different 
flight scenarios, different sets of image classes can be defined accordingly. Using the 
prior knowledge of a MAV’s whereabouts, we can reduce the number of image classes, 
and, hence, computational complexity as well as classification error. 

At each layer of the contextual taxonomy, downward, we conduct MSVC-based 
object recognition. Here, we generalize the meaning of image classes to any global- 
object appearance. Thus, the results from Sections 3 and 4 are readily applicable. In the 
following example, shown in Figure 3, each element of the set of locations {road, forest, 
lawn} induces subsets of objects, say, {car, cyclist} pertaining to road. Consequently, 
when performing MS VC, we consider only a small finite number of image classes, which 
improves recognition results. Thus, in spite of video noise and poverty of image detail, 
the object in Figure 3, being tested against only two possibilities, is correctly recognized 
as a car. 

7 Results 

In this section, we demonstrate the performance of the proposed vision system for real- 
time object recognition in supervised-learning settings. We carried out several sets of 
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Fig. 3. The hierarchy of visual contexts conditions gradual image interpretation: (a) a 128 x 128 
flight image; (b) categorization: sky/ground classification; (c) localization: road recognition; (d) 
car recognition. 



experiments which we report below. For space reasons, we discuss only our results for 
car and cyclist recognition in flight video. 

For training TSBNs, we selected 200 flight images for the car and cyclist classes. 
We carefully chose the training sets to account for the enormous variability in car and 
cyclist appearances, as illustrated in Figure 4 (top row). After experimenting with dif- 
ferent image resolutions, we found that reliable labeling was achievable for resolutions 
as coarse as 64x64 pixels. At this resolution, all the steps in object recognition (i.e., 
sky/ground classification, road localization and car/cyclist recognition), when the fea- 
ture set comprises all nine features, takes approximately 0.1s on an Athlon 2.4GHz 
PC. For the same set-up, but for only one optimal feature, recognition time is less 
than 0.07s, 4 which is quite sufficient for the purposes of moving-car or moving-bicycle 
tracking. Moreover, for a sequence of video images, the categorization and localization 
steps could be performed only for images that occur at specified time intervals, although, 
in our implementation, we process every image in a video sequence for increased noise 
robustness. 

After training our car and bicycle statistical models, we tested MSVC performance 
on a set of 100 flight images. To support our claim that MSVC outperforms MSBC, 
we carried out a comparative study of the two approaches on the same dataset. For 
validation accuracy, we separated the test images into two categories. The first category 
consists of 50 test images with easy-to-classify car/cyclist appearances as illustrated 
in Figure 4a and Figure 4b. The second category includes another 50 images, where 
multiple hindering factors (e.g. video noise and/or landscape and lighting variability, as 
depicted in Figure 4c and Figure 4d) conditioned poor classification. Ground truth was 
established through hand-labeling pixels belonging to objects for each test image. Then, 
we ran the MSVC and MSBC algorithms, accounting for the image-dependent optimal 
subset of features. Comparing the classification results with ground truth, we computed 
the percentage of erroneously classified pixels for the MSVC and MSBC algorithms. 
The results are summarized in Table 1 , where we do not report the error of complete 

4 Note that even if only one set of wavelet coefficients is optimal, it is necessary to compute all 
other sets of wavelets in order to compute the optimal one at all scales. Thus, in this case, time 
savings are achieved only due to the reduced number of features for which MSVC is performed. 
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Fig. 4. Recognition of road objects: (top) Aerial flight images; (middle) localization: road reco- 
gnition; (bottom) object recognition. MSVC was performed for the following optimized sets of 
features: (a) T seX = {H, 7, -45°}, (b) T sel = {H,± 75°}, (c) T seX = {±15°, ±45°}, (d) 
•F set = {H,± 45°}. 



misses (CM) (i.e., the error when an object was not detected at all) and the error of 
swapped identities (SI) (i.e., the error when an object was detected but misinterpreted). 
Also, in Table 2, we report the recognition results for 86 and 78 car/cyclist objects in 
the first and second categories of images, respectively. In Figure 5, we illustrate better 
MSVC performance over MSBC for a sample first-category image. 



Table 1. Percentage of misclassified pixels by MSVC and MSBC 





I category images 


II category images 


MSVC 


4% 


10% 


MSBC 


9% 


17% 



Finally, we illustrate the validity of our adaptive feature selection algorithm. In Fi- 
gure 6, we present MSVC results for different sets of features. Our adaptive feature 
selection algorithm, for the given image, found T se i — { //, —45°, ±75°} to be the opti- 
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mal feature subset. To validate the performance of the selection algorithm, we segmented 
the same image using all possible subsets of the feature set T . For space reasons, we 
illustrate only some of these classification results. Obviously, from Figure 6, the selected 
optimal features yield the best image labeling. Moreover, note that when all the features 
were used in classification, we actually obtained worse results. In Table 3, we present 
the percentage of erroneously classified pixels by MSVC using different subsets of fea- 
tures for our two categories of 100 test images. As before, we do not report the error of 
complete misses. Clearly, the best classification results were obtained for the optimal 
set of features. 



Table 2. Correct recognition (CR), complete miss (CM), and swapped identity (SI) 





I category images 
(86 objects) 


II category images 
(78 objects) 


CR 


CM 


SI 


CR 


CM 


SI 


MSVC 


81 


1 


4 


69 


5 


4 


MSBC 


78 


2 


6 


64 


9 


5 




(a) (b) (c) 



Fig. 5. Better performance of MSVC vs. MSBC for the optimal feature set Fsei = 
{H, I, ±15°, ±75°}: (a) a first-category image; (b) MSVC; (c) MSBC. 



8 Conclusion 

Modeling complex classes in natural-scene images requires an elaborate consideration 
of class properties. The most important factors that informed our design choices for a 
MAV vision system are: (1) real-time constraints, (2) robustness to video noise, and 
(3) complexity of various object appearances in flight images. In this paper, we first 
presented our choice of features: the HSI color space, and the CWT. Then, we introduced 
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Fig. 6. Validation of the feature selection algorithm for road recognition: (a) MSVC for the op- 
timized Tael = {H, —45°, ±75°}; (b) MSVC for all nine features in T\ (c) MSVC for subset 
Ti = {H, S , /}; (d) MSVC for subset T 2 = {±15°, ±45° ± 75°}. 



Table 3. Percentage of misclassified pixels by MSVC 





I category 


II category 


Tael 


4% 


10% 


T = {H, S, I, ±15°, ±45° ± 75°} 


13% 


17% 


Ti = {H, S,I} 


16% 


19% 


Ti = (±15°, ±45° ±75°} 


14% 


17% 



the TSBN model and the training steps for learning its parameters. Further, we described 
how the learned parameters could be used for computing the likelihoods of all nodes at 
all TSBN scales. Next, we proposed and demonstrated multiscale Viterbi classification 
(MSVC), as an improvement to multiscale Bayesian classification. We showed how to 
globally optimize MSVC with respect to the feature set through an adaptive feature 
selection algorithm. By determining an optimal feature subset, we successfully reduced 
the dimensionality of the feature space, and, thus, not only approached the real-time 
requirements for applications operating on real-time video streams, but also improved 
overall classification performance. Finally, we discussed object recognition based on 
visual contexts, where contextual information helps disambiguate the identity of objects 
despite a poverty of scene detail and obviates the need for an exhaustive search of 
objects over various scales and locations in the image. We organized test images into 
two categories of difficulty and obtained excellent classification results, especially for 
complex-scene/noisy images, thus validating the proposed approach. 
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Abstract. This paper proposes a method to integrate multiple linear- 
pushbroom panoramic images. The integration can be performed in real 
time. The technique is feasible on planar scene such as large-scale pain- 
tings or aerial/satellite images that are considered to be planar. The 
image integration consists of two steps: stitching and Euclidean recon- 
struction. For the image stitching, a minimum of five pairs of non- 
collinear image corresponding points are required in general cases. In 
some special configurations when there is cohinm-to-column image cor- 
respondence between two panoramas, the number of image corresponding 
points required can be reduced to three. As for the Euclidean reconstruc- 
tion, five pairs of non-collinear image corresponding points on the image 
boundaries are sufficient. 



1 Introduction 

The image mosaicing techniques have been applied in photogrammetry back in 
80’s for constructing large aerial and satellite photographs [16]. However, only 
until 90’s, intensive researches on automatic construction of panoramic image 
mosaics were carried out in the fields of computer vision and computer graphics 
[20]. Two main features of the image mosaicing concept are the abilities to 
increase the resolution and to enlarge the field of view of a camera. 

In computer vision, panoramic image mosaics often serve as representations 
of visual scene for a wide diversity of applications [14, 21, 6, 12, 8]. In the case 
when multiple panoramic images are provided, the depth information or other 
geometric properties of the 3D scene can be recovered [11, 18, 22]. In computer 
graphics, panoramic image mosaics play an important role in the technique of 
image-based rendering [2, 15, 4, 13, 10, 19]. The key idea of this technique is 
to rapidly generate novel views from a set of existing images. Panoramic images 
are also used widely in virtual reality systems to provide an immersive and 
photorealistic environment [1]. 

The traditional way of constructing a panoramic image mosaic is to align a set 
of matrix images of a common view by performing image transformations. When 
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images are acquired by unknown camera poses, there is a need to solve camera 
calibration problem before the image transformations may take place. Up until 
more recently, the line-camera concept for creating panoramic image mosaics 
were introduced [9, 5, 17], in which a sequence of slit (or line) images are used as 
the base elements instead of the matrix images for composing a panoramic image. 
Such a panorama is generated by joining together a sequence of line images side 
by side, and is called line-based panoramic images. The main advantage of using 
line images is to ease or even to avoid the camera extrinsic calibration problem 
so that panoramic mosaics can be generated simultaneously during the image 
acquisition process. One major trade-off of the line-based panoramic images is 
that the vertical image field of view is constrained by the resolution of the line 
image. There is lack of research on stitching two line-based panoramic images 
vertically to increasing the panorama’s field of view. 

Linear-puslrbroom camera model was first introduced by R. I. Hartley in 1997 
[5] , which belongs in the line-based panorama category. The main characteristic 
of linear-puslrbroom panoramic images is that the line-camera moves along a 
straight line during image acquisition. We investigated the possibility of inte- 
grating two such panoramic images under some additional geometric constrains. 
It is found that two linear-puslrbroom panoramas are geometrically related by 
an affine transformation if they capture a common planar scene. In this paper, 
the integration of two linear-puslrbroom panoramic images of planar scene is 
established for the first time. We conclude that only few image corresponding 
points are needed to perform the integration. This integration technique can be 
used for digitizing the large-scale 2D artworks in the museums or documenting 
the huge historical paintings on the wall. It can also be used on aerial or satellite 
images that are considered to be planar. 

The paper is organized as follows: linear-puslrbroom camera model is sum- 
marized in Section 2, in which the projection matrix and the LP-fundamental 
matrix of this camera model are recapped. The image integration method is 
reported in Section 3, in which the image transformation equations for image 
stitching and Euclidean reconstruction are elaborated respectively. Section 4 il- 
lustrates the integration result of a large-scale famous Chinese painting. Finally, 
conclusions are drawn in Section 5. 



2 Review of Linear-Pushbroom Camera Model 

A linear-puslrbroom camera can be considered as a perspective line-camera 1 
moving in a linear orbit with a constant velocity and a fixed orientation. As the 
line-camera moves, the view plane 2 sweeps out a region of space and 1-D images 
are captured. Finally, the whole 1-D images constitute a 2-D image which lies 
on a plane called the image plane in 3-D space. 

1 An optical system projecting an image onto a 1-D array sensor, typically a CCD 
array, is called a line-camera. 

2 The plane defined by the optical center and the sensor array is called a view plane. 
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An arbitrary point x = ( x,y,z) T in space is imaged and represented by two 
coordinates u and v. It has been shown in [5] that the linear-pushbroom camera 
model can be conducted as follows: 

(m, wv, w) t = M(x, y, z, 1) T (1) 

where w is a scale factor and M is a 3x4 projection matrix. The linear- 
pushbroom camera model, (u,wv,w) T = M(x, y, z, 1) T , should be compared 
with the basic pin-hole camera model. An obvious difference is that the matrix 
of the pin-lrold camera model is homogeneous; however, the linear-pushbroom 
camera matrix is not. That is, by multiplying linear-pushbroom camera ma- 
trix M with an arbitrary factor k, the v coordinate is unchanged while the u 
coordinate is scaled by k. 

Consider a point x = (x, y, z) T in space viewed by two linear-pushbroom 
cameras with projection matrices M and M'. Let u = (u,v) T and u' = (u',v') T 
be the mappings of point x on these two panoramas respectively. A cubic equa- 
tion p(u, v, u', v') = 0 called fundamental polynomial corresponding to these two 
cameras is introduced in [5], where the coefficients of p are determined by the 
entries of M and M'. It concludes in the paper that there exists a 4x4 matrix 
F such that the equation p(u, v, u', v') = 0 may be rewritten as follows: 

(vf, u'v' , v ' , l)i r 4 x 4 (w, uv, v, 1) T = 0. 

The matrix F is called the LP- fundamental matrix corresponding to the linear- 
pushbroom camera pair {M, M 7 }. The matrix expresses the relationships bet- 
ween corresponding curves in these two linear-pushbroom panoramic images. 

3 Integration of Linear-Pushbroom Panoramic Images 

Consider two linear-pushbroom panoramic images (or LP-mosaic images) tar- 
geting at a planar scene (such as a large painting on the wall). In this section, 
we propose an image integration method to stitch these two panoramic images 
by image correspondence information. A key point to achieve this purpose is 
that, for an arbitrary point on the first image, can its corresponding point in the 
second image be determined and vice versa? In general, only curve-to-curve re- 
lationships can be established for two LP-mosaic images according to the theory 
of LP-fundamental matrix. Hence, the correspondence is ambiguous (to a curve) 
for a point specified in the first image and vice versa. 

For perspective case, the point-to-point relationship can be established by 
imposing some scene constraints, such as co-planarity. The so-called planar ho- 
mography [3] can be determined by four given pairs of image correspondences, 
and a complete point-to-point relationship can be exactly established from the 
planar lromography thus determined. Planar lromography has been widely adop- 
ted for applications such as image mosaicing [1, 21] and panorama construction 
[ 20 ]. 

As to the linear-pushbroom case, what we are interested is whether the point- 
to-point relationships can be determined as well when the scene is planar? If the 




Stitching and Reconstruction of Linear-Pushbroom Panoramic Images 



193 



answer is yes, how many image correspondence pairs are required? These issues 
will be addressed in the following. 

з. 1 Image Stitching 

Let x,; = (xi,yi, Zi) T denote points in space that lie on a plane with planar 
equation E : axi + bi/i + czi + d = 0 and are viewed by two linear-pushbroom 
cameras. Let u, = ( Ui,Vi) T and u' = (w / -,r;') T be the mapping of point x, on 
the source and the destination LP-mosaic images respectively. We intend to find 
transformation equations, which transform all the image points of the source 
panorama to the destination panorama, based on a set of corresponding points 

и. j and u'. 

According to linear-pushbroom camera model discussed in the last section 
(equation 1), we have 



(«», mvi, Wi) T = M (x it yi, Zi , 1) T , 9 x 

( u'i , WiV'i, w[) T = M '{Xi, Vi, Zi, l) T 

where M and M 7 are 3x4 projection matrix associated to the source and the 
destination panoramic images, respectively. Let rrijk and m' fe , where 1 < j < 3 
and 1 < k < 4 denote the elements of M and M 7 respectively. Equation 2 plus 
the planar equation E : axi + byt + czi + d = 0 can be rearranged into the 
following seven equations: 

' Ui = mnXi + mi 2 yi + m^Zi + m u • ■ ■ (i) 

WiVi = m 2 iXi + m 22 yi + m. 23 Zi + m 2 4 . . . (ii) 

Wi = m 3 iXi + m 32 yi + m 33 Zi + m 34 . . . (iii) 

< ^ = m' n Xi + m’ 12 yi + m' 13 Zi + m' 14 . . . (iv) (3) 

w'^ = m 21 Xi + m' 22 yi + m 23 Zi + m 24 . . . (v) 
w'i = m' 31 Xi + m 32 yi + m' 33 Zi + m 34 . . . (vi) 

^ axi + byi + czi + d = 0 . . . (vii) 



Because w,; and w[ are not necessary to be the same for each corresponding 
pair Ui and u', we may deal with these two variables separately. First, for a 
given Ui, we use equations (i), (ii), (iii), (iv), and (vii) in equation 3 to find its 
corresponding ii( to avoid the influence of and the following equation holds: 



mu m 12 mi 3 mi 4 — tq 0 




Xi 


m 2 1 77122 77123 7?7 2 4 Vi 




Ui 


77l 3 i 77132 777.33 777 34 1 




Zi 


mil 777 12 7ni 3 m 14 ~ u 'i 




1 


a b c dO 




-Wi_ 



where the left 5x5 matrix is denoted as W\ . This is a set of five homogeneous 
equations with five unknowns. Because one of the five unknowns is 1, this means 
that this equation has a non-zero solution. Note that only det(ITi) = 0 can allow 
equation 4 to have a non-zero solution. Because the determinant of W\ consists 
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of terms in Vi, u' , tquj, and u[vi, the following equation with six coefficients, 
a 0 ~ a 5 , exists: 



a 0 + a\Ui + a 2 Vi + a 3 u' + a^UiVi + a^Vi = 0. 



( 5 ) 



Similarly, for a given u', we use equations (i), (iv), (v), (vi), and (vii) in equa- 
tion 3 to find its corresponding «; to avoid the influence of wq. We again obtain a 
set of five homogeneous equations with five unknowns. By the same argument as 
above, we may conclude that the following equation with six coefficients, bo ~ 65, 
exists: 

b 0 + b\Ui + b 2 u' i + b 2 ,v[ + b±Uiv\ + b^u'^ = 0. 

Suppose di and bi are known, given a point u* = ( Ui,Vi ) T , its corresponding 
point u' = (' u' i ,v' i ) T can be calculated by the following equations: 




— (ao+aiUi+a,2Vi+a4UiVi) 

a,3+a 5 Vi 

— (b 0 +b 1 Uj+b2U i ) 
bz+biUi+bsu^ 



( 6 ) 



These equations can be applied to transform all the image points of one pan- 
orama to the other panorama. After the transformation, we obtain an panoramic 
image. Therefore, the problem left to be solved is determining the values of the 
twelve coefficients, ao ~ 05 and bo ~ bo, by image corresponding points provided. 

Given n pairs of corresponding points, namely {tq, iq, w'.u'}, i € [l..n], where 
not all image points (tq, Vi) T lie on the same image row or image column. Equa- 
tion 5 can be restated as follows: 



a 0 l + aiU + a 2 W + CI3IJ + n4W + 05 W — 0 , ( 7 ) 

where 1 , U, V, U', W, W' and 0 are all n-vectors. Note that vector 1 has all 
its elements equal to one and vector 0 has all elements equal to zero. In order to 
have a non-trivial solution of ao ~ a$, the rank of matrix [ 1 , U, V, U', W, W'] 
must equal to five. Moreover, if {ao, ai, a 2 , a 3 , 04, 05} is a solution of equation 7, 
then {fcdo, ka -[ , ka 2l ka$, ka 4, ka$} will also be solutions for all k € 1R.. However, 
all solutions lead to the same value of v! i as shown in equation 6. Hence, we aim 
to find any set of ao ~ 05 that satisfies equation 7. 

Since there are six unknowns and the six-dimensional solution vector is up 
to a common scale factor, equation 7 can be solved with at least five pairs of 
image correspondences. In our work, we assume Oq + af + . . . + ag = 1. When 
n > 5, a least-squared-error solution can be obtained by solving the eigenvalue 
problem in association with the scatter matrix of the linear equation system. 
Similar arguments also apply to bo ~ 65. 

Singular case occurs when the rank of matrix [U, V, U', W, W'] is less 
than five. A common situation which causes the singular case is when vectors 
U and U' are linearly dependent, that is when we have U = AU' + B for some 
A 1 B £ IR. This situation happens when the two line sensors used for grabbing 
the two LP-mosaic images are parallel to each other. (Note: we explain this 
situation in Appendix.) We despite the case when there are only few image 
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corresponding points provided, so two or more vectors of U, V, U', W, and W' 
happen to be linear dependent because of poor sampling. 

When vectors U and U' are linearly dependent, instead of using equation 6, 
we derive another set of transformation equations to transform the image. First, 
since we know U = A\J' + B , the values of A and B can be obtained straightfor- 
wardly by solving a system of linear equations with at least two pairs of image 
correspondences. Secondly, by substituting equations (i) and (iv) in equation 3 
into Ui = Au\ + B, we get 

r (mn — Am'^Xi + (mi 2 — Am' 12 )yi + (mi 3 — Am' 13 )zi + (mi 4 — Am' 14 — B) = 0 

m 2 iXi + m22i/i + m.23Zi — ViWi + ni24 = 0 

m 3 iXi + m 32 yi + m 33 Zi - w t + m 34 = 0 
m' 21 Xi + m 22 yi + m 23 Zi - v'w' + m 24 = 0 
m 31 Xi + m' 32 yi + m’ 33 Zi -w(+ m 34 = 0 
_ axi + by.i +czi + d = 0. 

It implies 



mu — Am'n 


mi 2 — Am'i 2 


mi 3 — Am[ 3 mi 4 


— Am 44 


— B 0 


0 




Xi 


m 2 i 


m 2 2 


m 23 


m 2 4 
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0 
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m 3 1 


m 3 2 


m 33 


m 3 4 


-1 


0 




Zi 


m 21 


m 22 


m 23 


m 24 
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-v'i 
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m 3 1 


m 32 


m’ 33 


m 34 
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-1 




Wi 
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0 




_ w 'i _ 



where the left 6x6 matrix is denoted as W2. It has non-trivial solution if de^IF^) 
= 0, which means there exist coefficients cq ~ C3 such that the following equation 
holds: 

Cofj + Civ'i + C 2 ViV' + c 3 = 0. 

Given at least three pairs of image corresponding points, we are able to determine 
solutions of Co ~ C 3 . The resulting transformation equations in this case are as 
follows: 

I <=*5? 

] v f — —CQVj — Cz 
V C\+C2Vi 



3.2 Euclidean Reconstruction 

Through the stitching method introduced above, two LP-mosaic images can be 
integrated into a single one. In our work, one of the LP-mosaic images is selected 
as the reference image, and the other image is transformed (or wrapped) so that 
the two images can be registered. By doing so, the integrated panoramic image 
can be treated as being obtained by enlarging the viewing range of the line sensor 
used for grabbing the reference image and then employing this range-enlarged 
sensor to scan the same planar scene. However, such a scanning process cannot 
preserve some desired properties such as right angle and parallelism. 

From the reconstruction point of view, only an affine reconstruction can be 
obtained by using only image correspondence information as introduced in [5]. 
Without imposing other constraints on the scene or the camera, Euclidean recon- 
struction is impossible. Availability of constraints is application dependent. In 




196 C.-S. Chen, Y.-T. Chen, and F. Huang 



the following, we focus on the application of reconstructing a large-scale pain- 
ting. The priori knowledge that the painting is of rectangular shape provides 
scene constraints for Euclidean reconstruction. 

To upgrade the original reconstruction to a Euclidean one, let us image a 
virtual line sensor that scans the painting in the way that this sensor is alig- 
ned with the painting plane and is moved on this plane (that is, the painting 
is scanned by a ‘virtual’ Xerox machine). Furthermore, we assume that the line 
sensor is placed parallel to one of the four borderlines of the paintings, and the 
moving path is perpendicular to the line sensor. Then, the painting scanned 
with this virtual camera shall be of rectangular shape as well, and we call this 
rectangular frame the destination image. We aim to reconstruct the rectangu- 
lar frame by performing some transformation from the integrated panoramic 
image to the destination image. However, there is no image pixel information for 
the virtual rectangular frame. Thus, the only correspondence knowledge can be 
applied are the boundaries of the frame and the image. By assuming that the 
ratio of the width to the height of the rectangular frame is known, we will show 
that a Euclidean reconstruction can be achieved by employing the four pairs of 
borderline-to-borderline correspondences of the painting. 

Let (ttj, Vi) be the image coordinates of the source panoramic image (the inte- 
grated mosaic), and («.' , v \ ) be the image coordinates of the destination rectangu- 
lar frame. Based on the linear-pushbroom camera model as discussed in Section 
2, we have (w', w\v\, w' i ) T = M'(a;j, yi, Zi , 1) T . Since the destination image is ob- 
tained by a ‘virtual’ Xerox machine, becomes constant for all i and the value 
w\ can be absorbed by the second and the third row of matrix ML Hence, we 
have 

(«», WiVi, Wi) T = M (x it yi, Zi, 1 ) T 
K> l) T = M \xi, yi, Zi, 1 ) T 

and these two equations can be expanded as follows: 

Ui = mnXi + mi2Vi + m\ 3 Zi + m u . . . (i) 
w^i = m 2 iXi + m 2 2Vi + m. 23 Zi + m 2 4 • • • (ii) 

Wi = m 3 iXi + m 3 2 Vi + m 33 Zi + m 34 . . . (iii) , , 

u'i = m' YX Xi + m' 12 yi + m' 13 Zi + m ' 14 . . . (iv) 

v'i = m' 21 Xi + m' 22 yi + m 23 Zi + m 2 4 . . . (v) 

1 = m 31 Xi + m 32 yi + m 33 Zi + m 34 . . . (vi) 

From (i), (iv), (v) and (vi) in equation 8 , we obtain the following: 

777.il 777 12 77713 77714 — Ui Xi 

777 11 777 1 2 777 13 m ' 14 - u\ Vi 

777' 2 i 77722 m 23 m 24 ~ V i Z i 

.77731 m' 32 77733 77734 1 J L 1 

The determinate of the left 4x4 matrix must equal to zero, hence there exist 
coefficients d 3 ~ d 3 , such that the following equation holds: 




d 0 Ui + d\u'i + d2i’i T d 3 = 0. 



(9) 
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Moreover, from (ii), (iii), (iv), (v) and (vi) in equation 8, we have 



m 2 i m 22 m 23 m 24 -v t 




Xi 


m 3 1 m 32 m 33 m 34 -1 




Hi 


m'n m 12 m 13 m 14 ~ u i 0 




Zi 


m 21 m[ j 2 m' 23 m 24 - v[ 0 




1 


_m' 31 m [ !2 m 33 m 34 - 1 0 




Wi_ 



and by the same reason as above, we obtain the following equation: 

e 0 Vi + ei it' + e 2 v ' + e 3 tyit' + e 4 Viv\ + e$ = 0. (10) 

Let the four corner points of the virtual rectangular frame be (0,0), (W,0) i 
(0,17), and ( W,H ), respectively, where W and H are the width and height 
of the frame. Use these corner points as four inputs {u' l ,v' i ) together with the 
corresponding corners of the integrated panoramic image ( Ui,Vi ), we are able to 
determine the values of do ~ d 3 in equation 9. 

Consider a boundary point ( Ui,v.i) T lying on one of the border lines of the 
painting in the integrated panoramic image. One of its corresponding values 
u' and v\ is known, (which is equal to either W or H ), because the painting’s 
borderline-to-borderline correspondences are available. The unknown one can be 
obtained by equation 9 based on the determined values of g?o ~ d 3 . Thus, by given 
at least five image correspondences on the boundaries, we are able to determine 
the values of eo ~ e 3 in equation 10. Once all the values of the coefficients are 
known, the transformation equations can be derived. This transformation enables 
us to refine the integrated panoramic image to a Euclidean reconstruction. 

4 Experimental Results 

We conducted a synthetic experiment to demonstrate how the image correspon- 
dence error (i.e. the input noise) affects the image stitching result. The experi- 
ment was designed as follows. There are 250 coplanar points randomly distribu- 
ted in a bounded space. For each trial, two linear-pushbroom panoramic images 
with image resolutions of 170 x 550 are captured by two virtual line-cameras, 
whose intrinsic parameters are identical and set to be constant during the image 
acquisition. The starting positions and the moving velocities of these two ca- 
meras vary in each trial. The values of the position and the velocity vectors 
are randomly chosen within practical ranges. The image correspondence error 
is introduced by corrupting the ideal image projections by some random noise 
up to two image pixels. The image stitching error is measured as the average 
square-norm distance of all pairs of image corresponding points after merging. 
The average stitching error of 1000 trials is calculated for each noise scale. 

Figure 1 is an illustration of our synthetic experiment. The image correspon- 
ding points with labels are shown in two linear-pushbroom panoramic images 
as well as in resulting image after stitching process. Table 1 summarizes our 
experiment results, which suggests that the stitching error increases linearly as 
the input noise increases linearly from zero to two pixels. 
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Fig. 1 . Image A and B represent two linear-pushbroom panoramic images. Only 27 
image corresponding points (instead of 250) are shown here for clarity. The bottom 
figures illustrate the stitching results of two cases: input noise free (left) and noise up 
to two pixels (right). 

Table 1 . The image correspondence errors (in different noise scale) vs. the image 
stitching errors (the average square norm). 



Noise (pixel) 


0 


0.25 


0.5 


0.75 


1 


1.25 


1.5 


1.75 


2 


Error (pixel) 


0.05 


0.69 


0.82 


1.37 


1.77 


2.34 


2.49 


2.66 


3.48 



Moreover, a real image example is given in figure 2. The painting is named 
“Lang Shih-Ling One Hundred Stallions” . Sony DCR-VX2000 camera was used 
and only the central image column of each shot was employed for generating pan- 
oramic image. Two linear-pushbroom panoramic images of a small portion of the 
painting were acquired with certain overlapping, which are shown in figure 2 (A) 
and (B). The resolutions of these two panoramas are both 450 x 2000 pixels. 
Figure 2 (C) shows the image stitching result based on 53 identified image cor- 
responding points. The resulting image after Euclidean reconstruction is shown 
in figure 2 (D), which has resolutions 600 x 1700 pixels. This width/height ratio 
has been adjusted to meet the true ratio of the selected portion. 

5 Conclusion 

Planar homography, which can help determine complete point-to-point image 
relationships for a pair of images taken with perspective cameras, serves as a 
critical property for realizing many image mosaicing applications. Nevertheless, 
whether similar useful properties exist for other imaging model, such as the 
linear-pushbroom model, has not been well studied yet. In this paper, we demon- 
strate that with additional planarity constraint to the scene geometry, complete 
point-to-point image relationships can also be established between two linear- 
pushbroom panoramic images by employing at least five pairs of corresponding 
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Fig. 2. Portion of painting “Lang Shih-Ling One Hundred Stallions”. 



points. By the existence of such property, an image stitching method is develo- 
ped for integrating two LP-panoramas to enlarge the panorama’s field of view. 
Moreover, a Euclidean reconstruction method is presented to restore the pro- 
perties of 2D Euclidean geometry for reconstructing a rectangular frame. Both 
methods required only few pairs of image corresponding points as the input. The 
image integration algorithm as a whole can be performed in real-time. 
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Appendix 

Let E and E' be two LP-mosaic images of a common planar scene. We explain and 
illustrate when the two line sensors used to grab the two LP-mosaic images are parallel 
to each other, the following statement holds: for any pair of image corresponding points 
( Ui , Vi) T and («', u') T , the values iq and u[ are related by the equation: m = Art' + B, 
where A and B are constants. 

First, consider a plane in 3D and a line camera which moves along a straight line (set 
it to be the x-axis of the camera coordinate system) with constant velocity and taking 
line images at each position Co, Ci, C 2 , and so on, as shown in the top-left of figure 3. 
The y-axis is defined to be parallel to the line-sensors and is perpendicular to the 
x-axis. The z-axis is defined following the right-hand-rule. The geometric relationship 
between the plane and the camera coordinate system is unknown. 

The bottom-left of figure 3 shows the resulting LP-panoramic image E. The parallel 
lines Lo ~ L 4 on the plane are projected to image columns u = 0 ~ 4 respectively. 
Since the distance between any pair of points C; and C<+i is constant (as defined in 
Section 2), lines Li and Lj + 1 is a set of parallel lines with equal distance. 

Then, consider another LP-panoramic image E' , whose associated camera’s moving 
path is rotated with respect to the y-axis, as shown in the right-hand-side of figure 3. 
The equal- distance parallel lines L' 0 ~ L\ on the plane are projected to image columns 
v! = 0 ~ 4 respectively. 

In fact, lines Lq ~ L 4 are parallel to lines L' 0 ~ L' 4 as they both are parallel to the y- 
axis of the camera coordinate system. So, it is possible that lines Lo ~ L 4 also appear in 
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Corresponding image columns 



Fig. 3. Geometric configuration that illustrates the existence of image column-to- 
column correspondence. 



the image E' and vice versa. Hence, we have column-to-column correspondence between 
two LP-panoramic images. According to the basic geometrical property, when there is 
a column-to-column correspondence between two images as described, the relationship 
between those corresponding columns can be expressed by equation Ui = Au[ + B, 
where A and B are constants. 

Finally, we may conclude that as long as the y-axes of two camera coordinate 
systems, which are associated to the different LP-panoramic images, are parallel, we 
have relation m = Au[ + B for all corresponding m and v! i . 
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Abstract. This paper introduces a new concept of surveillance, namely, 
audio-visual data integration for background modelling. Actually, visual 
data acquired by a fixed camera can be easily supported by audio infor- 
mation allowing a more complete analysis of the monitored scene. The 
key idea is to build a multimodal model of the scene background, able to 
promptly detect single auditory or visual events, as well as simultaneous 
audio and visual foreground situations. In this way, it is also possible to 
tackle some open problems (e.g., the sleeping foreground problems) of 
standard visual surveillance systems, if they are also characterized by an 
audio foreground. The method is based on the probabilistic modelling of 
the audio and video data streams using separate sets of adaptive Gaus- 
sian mixture models, and on their integration using a coupled audio- 
video adaptive model working on the frame histogram, and the audio 
frequency spectrum. This framework has shown to be able to evaluate 
the time causality between visual and audio foreground entities. To the 
best of our knowledge, this is the first attempt to the multimodal model- 
ling of scenes working on-line and using one static camera and only one 
microphone. Preliminary results show the effectiveness of the approach 
at facing problems still unsolved by only visual monitoring approaches. 



1 Introduction 

Automated surveillance systems have acquired an increased importance in the 
last years, due to their utility in the protection of critical infrastructures and 
civil areas. This trend has amplified the interest of the scientific community 
in the field of the video sequence analysis and, more generally, in the pattern 
recognition area [1]. In this context, the most important low-level analysis is 
the so called background modelling [2,3], aimed at discriminating the static 
scene, namely, the background (BG), from the objects that are acting in the 
scene, i.e. , the foreground (FG). Despite the large related literature, there are 
many problems that are still open [3], like, for instance, the sleeping foreground 
problem. In general, almost all of the methods work only at the visual level, 
hence resulting in video BG modelling schemes. This could be a severe limitation, 
since other information modalities are easily available (e.g., audio), which could 
be effectively used as complementary information to discover “activity patterns” 
in a scene. 
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In this paper, the concept of multimodal , specifically audio-video, BG mo- 
delling is introduced, which aims at integrating different kinds of sensorial in- 
formation in order to realize a more complete BG model. In the literature, the 
integration of audio and visual cues received a growing attention in the last 
few years. In general, audio-visual information have been used in the context of 
speech recognition, and, recently, of scene analysis, especially person tracking. A 
critical review of the literature devoted to audio-video scene analysis is reported 
in section 2. 

In order to integrate audio and visual information, different adaptive BG 
mixture models are first designed for monitoring the segregated sensorial data 
streams. The model for visual data operates at two levels. The first is a typical 
time-adaptive per-pixel mixture of Gaussians model [2], able to identify the 
FG present in a scene. The second model works on the FG histogram, and is 
able to classify different FG events. Concerning the audio processing scheme, 
the concept of audio BG modelling is introduced, proposing a system able to 
detect unexpected audio events. In short, a multiband frequency analysis was 
first carried out to characterize the monaural audio signal, by extracting features 
from a parametric estimation of the power spectral density. The audio BG model 
is then obtained by modelling these features using a set of adaptive mixtures of 
Gaussians, one for each frequency subband. 

Concerning the on-line fusion of audio information with visual data, the most 
basic issue to be addressed is the concept of “synchrony”, which derives from 
psycho-physiological research [4,5]. In this work, we consider that visual and 
audio FG that appear “simultaneously” are synchronous, i.e., likely causally 
correlated. The correlation augments if both FG events persist along time. 

Therefore, a third module based on adaptive mixture models operating on 
audio-visual data has been devised. This module operates in a hybrid space 
composed by the audio frequency bands, and the FG histogram bins, allowing 
the binding of concomitant visual and audio events, which can be labelled as 
belonging to the same multimodal FG event. In this way, a globally consistent 
multilevel probabilistic framework is developed, in which the segregated adaptive 
modules control the different sensorial audio and video streams separately, and 
the coupled audio-video module monitors the multimodal scenario to detect 
concurrent events. The three modules are interacting each other to allow a more 
robust and reliable FG detection. 

In practice, our structure of BG modelling is able to face serious issues of 
standard BG modelling schemes, e.g., the sleeping FG problem [3]. 

The general idea is that an audio-visual pattern can remain an actual FG 
even if one of the components (audio or video) is missing. The crucial step is 
therefore the discovery of the audio-visual pattern in the scene. 

In summary, the paper introduces several concepts related to the multimo- 
dal scene analysis, discussing the involved problems, showing potentialities and 
possible future directions of the research. The key contributions of this work are: 
1) the definition of the novel concept of multimodal background model, intro- 
ducing, together with video data, audio information processing performing an 
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auditory scene analysis using only one microphone; 2) a method for integrating 
audio and video information in order to discover synchronous audio-visual pat- 
terns on-line', 3) the implementation of these audio-visual fusion principles in a 
probabilistic framework working on-line and able to deal with complex issues in 
video-surveillance, i.e., the sleeping foreground problem. 

The rest of the paper is organized as follows. In Section 2, the state of the art 
related to the audio- video fusion for scene analysis is presented. The whole stra- 
tegy is proposed in Section 3, and preliminary experimental results are reported 
in Section 4. Finally, in Section 5, conclusions are drawn. 



2 State of the Art of the Audio-Visual Analysis 

In the context of audio-visual data fusion it is possible to individuate two prin- 
cipal research fields: the on-line audio-visual association for tracking tasks, and 
the more generic off-line audio-visual association, in which the concept of audio- 
visual synchrony is particularly stressed. 

In the former, the typical scenario is an indoor known environment with 
moving or static objects that produce sounds, monitored with fixed cameras and 
fixed acoustic sensors. If an entity emits sound, the system provides a robust 
multimodal estimate of the location of the object by utilizing the time delay 
of the audio signal between the microphones and the spatial trajectory of the 
visual pattern [6,7]. In [6], the scene is a conference room equipped with 32 
omnidirectional microphones and two stereo cameras, in which a multi-object 
3D tracking is performed. With the same environmental configuration, in [8] an 
audio source separation application is proposed: two people speak simultaneously 
while one of them moves through the room. Here the visual information strongly 
simplifies the audio source separation. 

The latter class of approaches employs only one microphone. In this case 
the explicit notion of the spatial relationship among sound sources is no more 
recoverable, so the audio-visual localization process must depend purely on the 
concept of synchrony, as stated in [9] . Early studies about audio-visual synchrony 
comes from the cognitive science. Simultaneity is one of the most powerful cues 
available for determining whether two events define a single or multiple objects; 
moreover, psychophysical studies have shown that the human attention focuses 
preferably on sensory information perceived coupled in time, suppressing the 
others [4]. Particular effort is spent in the study of the situation in which the 
inputs arrive through two different sensory modalities (such as sight and sound) 
[5], 

Most of the techniques in this context make use of measures based on the 
mutual information criterion [8,10]. These methods extract the pixels of the 
video sequences that are most related to the occurring audio data using maxi- 
mization of the mutual information between the entire audio and visual signals, 
resulting therefore in an off-line processing. For instance, they are used for video- 
conference annotation [10]: audio and video features are modelled as Gaussians 
processes, without a distinction between FG and BG. The association is exploi- 
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ted by searching for a correlation in time of each pixel with each audio feature. 
The main problem is that it assumes that the visual pattern remains fixed in 
space; further, the analysis is carried out completely off-line. 

The method proposed in this paper tries to bridge these two research areas. 
To the best of our knowledge, the proposed system constitutes the first attempt 
to design an on-line integrated audio-visual BG modelling scheme using only one 
microphone, and working in a loosely constrained environment. 



3 The Audio-Video Background Modelling 



The key methodology is represented by the on-line time-adaptive mixture of 
Gaussians method. This technique has been used in the past to detect changes 
in the grey level of the pixels for background modelling purposes [2]. In our 
case, we would like to exploit this method to detect audio foreground , video 
foreground objects, and joint audio-video FG events, by building a robust and 
reliable multimodal background model. The basic concepts of this approach are 
summarized in Section 3.1. The customization in the case of visual and audio 
background modelling is presented in Section 3.2, and in Section 3.3, respectively. 
Finally, the integration between audio and video data is detailed in Section 
3.4, and how the complete system is used to solve a typical problem of visual 
surveillance system is reported in Section 3.5. 



3.1 The Time- Adaptive Mixture of Gaussians Method 

The Time-Adaptive mixture of Gaussians method aims at discovering the de- 
viance of a signal from the expected behavior in an on-line fashion. A typical 
video application is the well-know BG modelling scheme proposed in [2] 

The general method models a temporal signal with a time-adaptive mixture 
of Gaussians. The probability to observe the value z^\ at time t , is given by: 

P{zW) = Y j w?M(z^\oP) ( 1 ) 

r= 1 

where wi /Jr*" 1 and crf^ are the mixing coefficients, the mean, and the standard 
deviation, respectively, of the r-th Gaussian of the mixture associated to the 
signal at time t. At each time instant, the Gaussians in a mixture are ranked 
in descending order using the w/a value. The R Gaussians are evaluated as 
possible match against the occurring new signal value, in which a successful 
match is defined as a pixel value falling within 2.5er of one of the component. 
If no match occurs, a new Gaussian with mean equal to the current value, high 
variance, and low mixing coefficient replaces the least probable component. 

If Thit is the matched Gaussian component, the value is labelled as unex- 
pected (i.e., foreground) if J2r=i w r > T, where T is a threshold representing 
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the minimum portion of the data that supports the “expected behavior” . The 
evolution of the components of the mixtures is driven by the following equations: 

= (1 — a)^ 4 ” 1 ^ + aM^'\ 1 < r < R, (2) 

where is 1 for the matched Gaussian (indexed by r/,»t), and 0 for the others; 
a is the adaptive rate that remains fixed along time. It is worth to notice that 
the higher the adaptive rate, the faster the model is “adapted” to scene changes. 

The /i and a parameters for unmatched Gaussians remain unchanged, but, 
for the matched Gaussian component run, we have: 

A = £-p)»V ) +p* {t) ( 3 ) 

« = a - p)<r } + p 0 (t) - ) T (*« - ) (4) 

where p = otN (z^ \p~rL , VrL ) • 



3.2 Visual Foreground Detection 

One of the goal of this work is to detect untypical video activity patterns star- 
ting simultaneously with audio ones. In order to discover these visual patterns, 
a video processing method has been designed, which is composed by two modu- 
les: a standard per-pixel FG detection module, and an histogram-based novelty 
detection module. 

The former is realized using the model introduced in the previous section in 
a standard way [2], in which the processed signal z ^ is the time evolution of 
the gray level. We use a set of independent adaptive mixtures of Gaussians, one 
for each pixel. In this case, an unexpected valued pixel Zuv (where u, v are the 
coordinates of the image pixel) is the visual foreground , i.e., Zuv G FG. Please, 
note that all mixtures’ parameters are updated with a fixed learning coefficient 
a. 

The latter module is also realized using the time-adaptive mixture of Gaus- 
sians method, using the same learning rate a of the former module, but in this 
case we focus on the histogram of those pixels classified as foreground. The idea 
is to compute at each step the gray level histogram of the FG pixels and asso- 
ciating an adaptive mixture of Gaussian to each bin, looking for variations of 
the bin’s value. This means that we are monitoring the number of pixels of the 
foreground that have a specific gray value. If the number of pixels associated 
to the foreground grows, i.e., some histogram bins increase their values, then an 
object is appearing in the scene, otherwise is disappearing. We choose to monitor 
the histogram instead of the number of FG pixels directly (which can be in prin- 
ciple sufficient to detect new objects), as it allows the discrimination between 
different FG objects, and in order to detect audio-visual patterns composed by 
single objects. We are aware that this simple characterization leaves some am- 
biguities (e.g., two equally colored objects are not distinguishable, even if the 
impact of this problem may be weakened by increasing the number of bins), but 
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this representation has the appealing characteristic of being invariant to spatial 
localization of the foreground, which is not constrained to be statically linked to 
a spatial location (as in other audio- video analysis approaches) 1 . 



3.3 Audio Background Modelling 

The audio BG modelling module aims at extracting information from audio 
patterns acquired by a single microphone. In the literature, several approaches 
to audio analysis are present, mainly focused on the computational translation of 
psychoacoustics results. One class of approaches is the so called “computational 
auditory scene analysis” (CASA) [12], aimed at the separation and classification 
of sounds present in a specific environment. Closely related to this field, but not 
so investigated, there is the “computational auditory scene recognition” (CASR) 
[13,14], aimed at environment interpretation instead of analyzing the different 
sound sources. 

Besides various psycho-acoustically oriented approaches derived from these 
two classes, a third approach tried to fuse “blind” statistical knowledge with 
biologically driven representations of the two previous fields, performing audio 
classification and segmentation tasks [15], and source separation [16,17] (blind 
source separation). In this last approach, many efforts are devoted in the speech 
processing area, in which the goal is to separate the different voices composing 
the audio pattern using several microphones [17] or only one monaural sensor 
[16]- 

The approach presented in this paper could be inserted in this last category: 
roughly speaking, we implement a multiband spectral analysis on the audio signal 
at video frame rate, extracting energy features from a \ , ci 2 , . . . , om frequency 
subbands. More in detail, we subdivide the audio signal in overlapped temporal 
window of fixed length W a , in which each temporal window ends at the instant 
corresponding to the t-th video frame, as depicted in Fig.l. For each window, 
a parametric estimation of the power spectral density with the Yule- Walker 
Auto Regressive method [18] is performed. In this way, an estimation ap of the 
spectral energy relative to the interval [t—W a , t] is obtained for the z-tlr subband, 
i = 1,2,..., M. These features have been chosen as they are able to discriminate 
between different sound events [13]; further, they can be easily computed at an 
elevate temporal rate. 

As typically considered [16], the energy during time in different frequency 
bands can transport independent information. Therefore, we instantiate one 
time-adaptive mixture of Gaussians for each band of the frequency spectrum. 
Also in this case, all mixtures’ parameters are updated with a fixed learning 
coefficient a, equal to the one used for the video channel. In this way, we are 
able to discover unexpected audio behaviors for each band, indicating an audio 
foreground. 

1 Actually, more sophisticated tracking approaches based on histograms have already 
been proposed in literature [11], and are subjects of future work. 
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Fig. 1. Organization of the multimodal data set: at each video frame, an audio signal 
analysis is carried out using a temporal window of length W a 



3.4 The Audio-Visual Fusion 



The audio and visual spaces are now partitioned in different independent sub- 
spaces, the audio subbands a±,a 2 , . . . ,clm, and the video FG histogram bins 
hi, /i 2 ) • • • ) /ijvi respectively, in which independent FG monomodal patterns 
may occur. Therefore, given an audio subband af, and a video histogram 
bin hf at time t, we can define an history of the mono-modal FG patterns 

Af , i = 1, , M, and Hf , j = 1 ,... ,N, as the patterns in which the values 
of a given component of the i — th mixture for the audio, and the j — th mixture 
for the video are detected as foreground along time. Formally, let us denote A f 
and Hf as: 



A*) _ (* 9 . 4 + 1 ) 







i) ^(*<*.j + l) 



• • > a f e FG] 
■ ■ , hf e FG] 



( 5 ) 

( 6 ) 



where t q j is the first instant at which the q — th Gaussian component of the 
audio mixture of the i — th sub-band becomes FG, and the same applies for 
t Ui j related to the video data. Clearly, and H^' 1 are possibly not completely 
overlapped, so t, hl in general can be different from t u j. Therefore, in order to 
evaluate the degree of concurrency, we define a concurrency value as f3i,j = 
\t q ,i — t u .j | ■ Obviously, the higher this value, the weaker the synchronization. 

As previously stated, the synchronism gives a natural causal relationship for 
processes coming from different modalities [4]. In order to evaluate this causal 
dependency along time, we state as highly correlated those concurrent audio- 
video FG patterns explaining, in their jointly evolution, a nearly stable behavior. 
Consequently, we couple all the audio FG values with all the visual FG values 
occurring at time step t, building an Mx N audio-visual FG matrix AV^\ where 



AVW(i,j) 



{af\hf) if a. {t) € FG A hf e FG 
empty otherwise 



( 7 ) 



This matrix gives a snapshot of the degree of synchrony between audio and visual 
FG values, for all i,j. If AV^(i,j) is not empty, probably, Af and Hf are in 
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some way synchronized. In this last case, we choose to model the evolution of 
these values using an on-line 2D adaptive Gaussian model. Therefore, at each 
time step t, we can evaluate the probability to observe a pair of audio-visual FG 
events, as 



P(AV^(i,j)) = J2 

r— 1 



W 



AV r 






(8) 



Intuitively, the higher the value of the weight matched by the observation 

at time t, namely > the more stable are the coupled audio-visual 

FG values along time, and it is more probable that a causal relation is present 
between audio and visual FG. 

All the necessary information to assess the synchrony and the stability of a 
pair of audio and video FG patterns is now available. Therefore, a modulation of 
the evolution process of the 2D Gaussian mixture model is introduced in order to 
give more importance to a match with a couple of FG values belonging to likely 
synchronized audio and video patterns. We would like to impose that the higher 
the concurrency, the faster the stability of an AV value must be highlighted. In 
formulas, omitting the indices i,j for clarity 



W< AVr = (1 - a Av)wA V ^ + aAV^AV, 1 < r < R, 



where 



m AV ~ 



Pi,j + 1 

0 



1 

|i«-*u|+l 



for the matched 2D Gaussian 
otherwise 



(9) 



(10) 



This equation 2 implies that if the synchronization does not occur at the 
same instant, the weight grows more slowly, and viceversa. 

In order to subsume the concurrency and the stability behavior of the mul- 
timodal FG patterns, we finally introduce the causality matrix T ^ = [ 7 * •], for 
all* = 1, ... , M, and j = 1, . . . , N, where 

7 (t) (*> j) = W AVrlu (11) 

where w^y’P t is th e weight of the 2D Gaussian component of the model matched 
by the pair of FG values 

As we will see in the experimental session, this model well describe the sta- 
bility degree of the audio-visual FG, in an on-line unsupervised fashion. 



3.5 Application to the Sleeping Foreground Problem 

The sleeping foreground problem occurs when a moving object, initially detected 
as foreground, stops, and becomes integrated in the background model after a 

2 Any function inversely proportional to ( 3 ij could be used; actually, different function 
choices do not sensibly affect the method performances. 
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certain period. We want to face this situation, under the hypothesis that there is 
a multimodal FG pattern, i.e. detecting the correlation between audio and video 
FG. In this situation, we maintain as foreground both the visual appearance of 
the object and the audio pattern detected, as long as they are present and stable 
in time. Technically speaking, we compute the learning rate of the mixture of 
Gaussians associated to the video histogram’s bin j 

= min (d, 1 — ma xy^(i,j)) (12) 

J i 

where a is the learning rate adopted for both the segregated sensorial channels. 
The learning rates of the adaptive mixtures of all pixels which gray level belongs 
to the histogram bin j become or*’ . Moreover, also the learning rate of the 

mixture associated to the band argmaxj 7^(i, j) becomes oiy . This measure 
implies that the most correlated audio FG pattern with the j — th video FG 
pattern guides the evolution step, and viceversa. In practice we can distinguish 
min(M, N) different audio-video patterns. This may appear a weakness of this 
method, but this problem may be easily solved by using a finer discretization of 
the audio spectral, and of the histogram spaces. Moreover, other features could 
be used for the video data modelling, like, for instance, color characteristics. 

4 Experimental Results 

An indoor audio-visual sequence is considered, in which two sleeping FG situa- 
tions occur: the former is associated with audio cues, and the latter is not. We 
will show that our system is able to deal with both situations. 

More in detail, the sequence is captured at 30 frames per second, and the 
audio signal is sampled at 22.050 Hz. The temporal window used for multi- 
band frequency analysis is equal to 1 second, and the order of the autoregressive 
model is 40. We undersample the 128 x 120 video image in a grid of 32 x 30 
locations. Finally, we use 12 bins for the FG color histogram. Analogously, we 
perform spectral analysis using M = 16 logarithmic spaced frequency subbands, 
in which the frequency is measured in radians in the range [0, 7r], and the power is 
measured in Decibel. As a consequence, we have an audio-visual space quantized 
in MxN = 16x12 elements. All adaptive mixtures are composed by 4 Gaussian 
components, and the learning parameter for the AV mixtures is fixed to 0.05, 
and for the separated channels d=0.005, initially. 

We compare our results with those proposed by an ’’only video” BG mo- 
delling, choosing as reference the standard video BG modelling adopted in [2], 
showing: 1) the resulting analysis of both BG modelling schemes; 2) the audio 
BG modelling analysis; 3) the histogram FG modelling analysis, able to indivi- 
duate the appearance of new visual FG in the scene, and 4) the causality matrix, 
ordered by audio subbands per video histogram bins, that explains intuitively 
the intensity causal relationship in the joint audio-visual space. 

As one can observe in Fig. 2, at frame 50 both per-pixel BG modelling schemes 
locates a FG entering in the scene. This causes a strong increment in the gray 
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568 a) 568 b) 568 c) 568 d) 568 e) 568 f) 




668 a) 668 b) 668 c) 668 d) 668 e) 668 f) 




T03 a) 703 b) 703 c) 703 d) 703 e) 703 f) 




098 a) 998 b) 998 c) 998 d) 998 c) 998 f) 



Fig. 2. Comparative results: a) Original sequence; b) Ordinary per pixel video BG mo- 
delling; c) Our approach; d) Video novelty detection; e) Audio background modelling; 
f) Causality matrix at time t; 



level of the FG histogram that correctly detects this object as new (Fig. 2-50 d) 
(the lighter bins indicate FG). At frame 72, the person begins to speak, causing 
an increment of some subbands of the audio spectrum, which is detected as FG 
by the audio module (Fig. 2-72 e)). Due to the (loose) synchrony of the audio 
and visual events, the causality matrix evidences a concurrency, as depicted in 
Fig. 2-72 e). Here, the lightest colored value indicates max* 7^(*,i), i.e. , the 
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maximum causality relation for all audio subbands i, given the video histogram 
bin j. Therefore, proportionally to the temporal stability of the audio-video 
FG values, the causality matrix increments some of its entries. Consequently, 
the learning coefficients of the corresponding audio, histogram, and pixels FG 
models, become close to zero according to eq. 12. In this way, the synchronized 
audio and visual FG which remain jointly similar along time are considered as 
multimodal FG. In the typical video BG modelling scheme, if the visual FG 
remains still in the scene for a lot of iterations (Fig. 2-568 a) and 668 a)), it 
loses all its meaning of novelty, so becoming assimilated in the background (Fig. 
2- 568 b) and 668 b)). More correctly, in the multimodal case, the FG loses its 
meaning of novelty only if it remains still without producing sound. In Fig. 2- 
568 c) and 668 c) , the visual aspect of the FG is maintained from the audio FG 
signal, by exploiting the causality matrix. 

The audio visual fusion is also able to preserve the adaptiveness of the BG 
modelling, if the case. In Fig. 2-703 a) and 719 a), a box falls near the talking 
person, providing new audio and video FG, but, after a while, the box becomes 
still and silent. In this case, it is correct that it becomes BG after some time 
(see Fig. 2- 998 b). Also in our approach, the box becomes BG, as the audio 
pattern decreases quickly, so that no audio visual coupling occurs, and after 
some iterations the box vanishes, whereas the talking person remains detected 
(Fig. 2- 719 c) and 998 c)). A subtle drawback is notable in Fig.2- 998 c): some 
parts of box do not completely disappears, because their gray level is similar to 
that of the talking person, modelled as FG. But this problem could be faced by 
using a different approach to model visual data (instead of the histogram), or, 
for instance, a finer quantization of the video histogram space. 

5 Conclusions 

In this paper, a new concept of multimodal background modelling has been intro- 
duced, aimed at integrating audio and video cues for a more robust and complete 
scene analysis. The separate audio and video streams are modelled using a set 
of adaptive Gaussian models, able to discover audio and video foregrounds. The 
integration of audio and video data is obtained posing particular attention to 
the concept of synchrony, represented using another set of adaptive Gaussian 
models. The system is able to discover concurrent audio and video cues, which 
are bound together to define audio-visual patterns. The integrated probabilistic 
system is able to work on-line using only one camera and one microphone. Preli- 
minary experimental results have shown that this integration permits to face 
some problems of still video surveillance systems, like the FG sleeping problem. 
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Abstract. While there is a vast amount of literature considering PDE 
based inpainting and inpainting by texture synthesis, only a few publica- 
tions are concerned with combination of both approaches. We present a 
novel algorithm which combines both approaches and treats each distinct 
region of the image separately. Thus we are naturally lead to include a 
segmentation pass as a new feature. This way the correct choice of tex- 
ture samples for the texture synthesis is ensured. We propose a novel 
concept of “local texture synthesis” which gives satisfactory results even 
for large domains in a complex environment. 



1 Introduction 

The increase in computing power and disk space over the last few decades has 
created new possibilities for image and movie postprocessing. Today, old photo- 
graphs which are threatened by bleaching can be preserved digitally. Old cellu- 
loid movies, taking more and more damage every time they are exhibited, can 
be digitized and preserved. Unfortunately much material has already suffered. 
Typical damages are scratches or stains in photographs, peeled of coatings, or 
dust particles burned into celluloid. All these flaws create regions where the 
original image information is lost. Manual restoration of images or single movie 
frames is possible, but it is desirable to automate this process. Several inpainting 
algorithms have been developed to achieve this goal. In this paper we focus on 
single image inpainting algorithms (there exist more specialized algorithms for 
movie inpainting). They may roughly be divided into two categories: 

1. Usually PDE based algorithms are designed to connect edges (discontinuities 
in boundary data) or to extend level lines in some adequate manner into the 
inpainting domain, see [1,2,3,4,5,6,7,8,9,10,11]. They are targeted on extra- 
polating geometric image features, especially edges. I.e. they create regions 
inside the inpainting domain. Most of them produce disturbing artifacts if 
the inpainting domain is surrounded by textured regions, see figure 1. 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3022, pp. 214-224, 2004. 
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2. Texture synthesis algorithms use a sample of the available image data and 
aim to fill the inpainting domain such that the color relationship statistic 
between neighbored pixels matches those of the sample, see [12,13,14,15,16, 
17,18]. They aim for creating intra-region details. If the inpainting domain 
is surrounded by differently textured regions, these algorithms can produce 
disturbing artifacts, see figure 2. 




Fig. 1. An example which is not well suited for PDE inpainting. The texture synthesis 
algorithm achieves visually attractive results (right picture), PDE based inpainting 
algorithms fail for large sized domains surrounded by strongly textured areas (middle 
picture) 




Fig. 2. Texture synthesis may run into problems, if the sampling domain is chosen 
inappropriately. The balustrade in the left picture should be removed by resynthesizing 
image contents, taking the rest of the picture as sample texture. The result can be seen 
on the right the ladder initiated spurious sampling of trees and leaves into the brick 
wall 



Until now there are only a few algorithms trying to treat geometric image 
features and texture simultaneously: 
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— An algorithm based on texture template spectrum matching has been pro- 
posed in [19]. This algorithm does not fit into either one of the two categories 
mentioned above. 

— A special purpose algorithm was used in [20] for restoring missing blocks 
in wireless transmitted compressed images. Common lossy compression al- 
gorithms (like e.g. JPEG) divide an image into 8x8 pixel blocks that are 
independently compressed. If a corrupted block is detected it is reconstruc- 
ted rather than retransmitted to decrease latency and bandwidth usage. 
Reconstruction occurs by classifying the contents of adjacent blocks into eit- 
her structure or texture. Depending on this classification the missing block 
is restored by invoking either a PDE inpainting or a texture synthesis algo- 
rithm. 

— Closely related to our inpainting technique are the algorithms proposed in 
[21,22], which are most natural to compare with in the course of this paper. 
They differ from our algorithm by the choice of the subalgorithms in each 
step. Moreover, we propose to perform a segmentation step to determine 
appropriate texture sample regions. This prevents artifacts arising in the 
texture synthesis step, as exemplified in figure 2. This problem is not adressecl 
in [21,22]'. 



2 The Algorithm 

The proposed inpainting algorithm consists of five steps: 

1. Filtering the image data and decomposing it into geometry and texture 

2. PDE inpainting of the geometry part 

3. Postprocessing of the geometry inpainting 

4. Segmentation of the inpainted geometry image 

5. Synthesizing texture for each segment 

We will describe each step in detail in the following subsections. The image, 
denoted by a function u : D — > R (or R 3 for color images), is defined on the 
image domain D C R 2 . A user specified mask function m : D — > [0, 1] marks the 
inpainting domain 1? = supp(m). A value of 1 in the mask function highlights 
the flawed region. The mask is designated to continuously drop to zero in a small 
neighborhood (i.e., a few pixels) outside of the flawed region. This “drop down” 
zone will later be used to smoothly blend the inpainting into the original image. 



2.1 Filtering and Decomposition 

A nonlinear diffusion filter of Perona-Malik type [23] is applied to the image u, 
i.e. it is evolved according the partial differential equation 

3 v 

- = S7-{d{\\S7u{x,y)\\)-S7u) (1) 

where the diffusivity d(s) is chosen to be 



d{s) 



1 + 



A 2 



1 



(2) 
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with a suitably chosen parameter A. The filtered image g is the solution of (1) 
at a specified time to, characterizing the strength of filtering. The effect of the 
filter is that g reveals piecewise constant intensities and contains little texture 
and noise. Thus we call g the geometry part . We set u = g + t and refer to t as 
the texture part. All along this paper we consider noise as a texture pattern. 

Recently several advanced techniques for image decomposition have been 
proposed, see [24,25,26]. In [21] decomposition is done by using the model from 
[24], i.e. by jointly minimizing the BV seminorm of the function g and Meyers G- 
norm (see [27]) of the function t. Using this model allows one to extract texture 
without noise, which leads to an approximative decomposition u ~ g + t, with 
geometry part g and texture t. For inpainting this seems not to be optimal, since 
a noisefree inpainting in a noisy image may look inadequate. 

In [22] a lowpass/highpass filter is applied to the image to attain g resp. t. 
The decomposition is thus not into geometry and texture but rather into high 
and low frequencies. 



2.2 PDE Inpainting 

For the inpainting of the geometry part we use the Ginzburg-Landau algorithm 
proposed in [11]. Here we give a short overview: 

We calculate a complex valued function g , whose real part is the geometry 
image g scaled to a range of values between -1 and 1. Further we demand that 
||<7(x, 2/)||max = 1 for (x, y), and the imaginary part be nonnegative, 3 g > 0. 
The function g is evolved using the complex Ginzburg-Landau equation 

= A 9+ (! - 115(^2/) Umax) 3 ( 3 ) 

inside 17, where the available data g\on is specified as Dirichlet boundary 
condition. Here e £ R. is a length parameter specifying the width of edges 
in the inpainting. || • || max denotes the maximum norm of the components of 
g(x,y ), which is just the absolute value \g(x,y)\ if g is a grayscale image, and 
max{|< 7 red |, \g9 reen \ j \g blue \} for RGB images. The real part iftg of the evolved 
image at some time To (i.e., if g is “close enough” to steady state) is rescaled to 
the intensity range of g and constitutes the inpainting. 

For a more detailed description we refer to our presentation in [11]. Theoreti- 
cal results about the Ginzburg-Landau equation and similar reaction-diffusion 
equations can be found in [28,29]. 

In [21] the PDE inpainting method from [4] is used. In comparison with 
(3) this algorithm (judging from the examples given in [4]) creates smoother 
and better aligned edges, but the Ginzburg-Landau algorithm reveals higher 
contrast and less color smearing. 

In [22] the inpainting technique from [10] is utilized. This algorithm is not 
designed to create edges in the inpainting domain. For the particular application 
to inpaint a low pass filtered image this is no problem. 
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2.3 Postprocessing 

The Ginzburg-Landau algorithm sometimes produces kinks and corners in ed- 
ges. A detailed discussion of this phenomenon has been given in [11]. This is 
the price for high contrast edges in the inpainting domain. To straighten the 
kinks we apply a coherence enhancing anisotropic diffusion filter to the image 
g on the inpainting domain 17. This filter is described in [30]. We implemented 
this diffusion filter using the multi-grid algorithm outlined in [31]. Since diffu- 
sion happens mostly along edges and not across, the contrast does not suffer 
significantly. 



2.4 Segmentation 

As a preparation for the texture synthesizing step the inpainted and postproces- 
sed geometry image (which for the sake of simplicity is again denoted by g) is 
segmented. We employ a gradient controlled region growing algorithm, inspired 
by the scalar Ginzburg-Landau equation: 

Let (Si C D) i _ 0 N be the (a-priori unknown) segmentation of D , i.e. f)<S, = 

i 

0 and (J S, D Q. We assume that every pixel in 17 belongs to exactly one segment 

i 

Si . We do not need to segment the whole image domain D since segments which 
have no intersection with 17 do not affect the final result. We derive the sets Si 
from a set of auxiliary functions (.S', : D — >• R)j=o..jv : 



1 . 

2 . 

3. 



Set i = 0. 

Choose an arbitrary pixel (j,k) £ 17 \ (J S n . Set .S) 

0<7l<2 



Si(j,k) = +1. 

Evolve Si according to the equation 

f) Q. 

-^ = ASi-P'(S l )-a\\S7g(x,y)\\ 

where P'(x) is the derivative of the polynomial potential 

. 9 4 19 o 9 2 57 

P(x ) = -ar -I x 6 — - x 2 x 

y ’ 4 8 2 8 



— 1, except for 

( 4 ) 

( 5 ) 



until a steady state is reached. 

4. Set Si = supp ^max ^0, S 1 /^, where S- is the steady state solution from 
the previous step. 

5. If (J S n D 17 terminate the algorithm, else set i <— i + 1 and continue 

0<n<i 

with step 2. 



Explanation of equation (4): like in the scalar Ginzburg-Landau equation P'(x) 
is the derivative of a bistable polynomial potential P(x), forcing Si(x) to take 
on values close to +1 or —1. Here P(x) is chosen to be nonsymmetric, having a 
shallow minimum at x = — 1 and a deep minimum at x = +1. Assume a = 0. 
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P(x) is constructed such that the diffusion caused by the Laplacian is strong 
enough to move Si in the surrounding of the seeding pixel from the negative 
to the positive minimum. Thus under continued evolution the +1 domain will 
spread out all over D. If a > 0 the term depending on Vg is a forcing term acting 
against diffusion and thus eventually stops propagation at edges of g. Eventually 
further terms could be added, e.g. penalizing curvature of Sj to prevent the 
segments from crossing small narrow gaps. This turned out not to be required 
by our application. 

Large changes in Si occur only in a small region between the +1 and the — 1 
domain, so this algorithm can be efficiently implemented using a front tracking 
method. Therefore only a small fraction of all pixels has to be processed at each 
iteration. 



2.5 Texture Synthesis 

For the texture synthesis step we employed the algorithm from [13]. In [22] the 
same algorithm with a different implementation has been used. In [21] the tex- 
ture synthesis algorithm from [18] was used, which should give similar synthesis 
results as [13]. In both [21,22] all of the available image data t\jj\a is taken as 
sample to synthesize texture in all of t |j?. Since the texture synthesis proceeds 
from the border on inwards into the inpainting domain it is evident that tex- 
ture can be continued without artifacts. A few “perturbing pixels” might suffice 
though to make the algorithm use texture sample data from unsuitable image 
locations, see figure 2. 

To circumvent the shortcomings of this “global sampling texture synthesis” 
we introduce “texture synthesis by local sampling” : for every segment Si we take 
ft sample _ ^ ^ q and fl^ vnth = Si D ft to be the texture sample region, resp. 
the texture synthesizing region. Then the texture synthesis algorithm from [13] 
is applied to each pair (J?® omp!e , fif ynth ^ individually. Here we tacitly assume 
that differently textured regions belong to different segments. Two neighboring 
regions with different textures belong to the same segment if they have similar 
intensities, due to the initial diffusion filtering. However, our experiments have 
shown that the impact of a few wrong texture samples is not significant. 

The texture synthesis is the most time consuming part of our inpainting 
algorithm: for every (j, k) G f] s v nth se t tj t k t m>n , where (to, n) G ^sample j g 
chosen such that t in a neighborhood of (to, n) most closely resembles t in a 
neighborhood of (j,k). Finding the most similar neighborhood for every pixel 
in fisynth leads to a considerable amount of nearest neighbor searches in a high 
dimensional vector space. In most of our examples the number of test vectors 
(i.e. the number of pixels in ft sa mpie) was too small - resp. the dimension of the 
vector space was too high - for a binary search tree to be effective. The runtime 
of the texture synthesis using a search tree could not be improved compared 
to an exhaustive search. See [32] for a efficiency discussion of nearest neighbor 
search algorithms and the presentation of the algorithm that we used. 
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2.6 Assembling 

In a last step the synthesized textures are added to the geometry inpainting. 
The final inpainting is blended onto the initial (flawed) image, i.e. Ufi na i = 
m ■ (g + 1) + (1 — to) • Uinitiai where in is the mask function. This is to soften the 
impact of discontinuities which could arise from texture synthesis. 



3 Results and Discussion 

The results presented in this section are chosen to highlight various situations 
an inpainting algorithm has to tackle: 

1. In figure 3 the inpainting domain consists of thin and long structures, as 
occurring in scratch and text removal. This is easier to inpaint than equally 
sized compact shaped areas, since edges have to be established over short 
distances only. Additionally, if the size of typical texture features is com- 
parable to the “width” of the inpainting domain, then even inappropriately 
synthesized texture does not necessarily produce an artifact. 

2. In figure 4 inpainting is easy because 12 is surrounded by a single homoge- 
neous region. The PDE inpainting only needs to adapt the appropriate color. 
An appropriate texture sampling region is easily found. 

3. Inpainting is difficult if the object to be removed covers a variety of different 
regions containing complex textures, which happens mainly in airbrushing 
applications, see figure 5. 

One difficult example is shown in figure 5: the balustrade to be removed covers 
three adjacent regions with two different textures. The brick wall is considered 
as two distinct regions, because of the noticeable difference in brightness. Com- 
pared to the plain texture synthesis result from figure 2 no improper texture is 
synthesized into the wall, due to the local sampling. Unfortunately the corners 
of the building are found to be another segment and the brick pattern on the 
corners is not synthesized satisfactorily (neither is it in figure 2). More examples 
can be found in [33]. 



3.1 Choice of Parameters 

Our proposed algorithm contains several numeric parameters that may be tuned: 
the edge sensitivity A in the pre-filtering, the edge width e in the PDE inpainting, 
an edge sensitivity and a regularization parameter for the post-processing (which 
have not been explicitly mentioned) and the strength a of the forcing term in 
the segmentation. Further, for the diffusion equations in the pre- and post- 
filtering stopping times (resp. stopping cireteria) have to be specified (not for the 
inpainting and the segmentation, which are evolved to steady state). Moreover, 
the size and the shapes of the neighbourhood regions in the texture synthesis 
phase could also be adjusted. 

The examples in this paper have been created with a fixed parameter setting 
that has been tuned on an appropriate training set. We found that the quality of 
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Fig. 3. This example is easy since the inpainting domain consists of long thin structures 
only 




Fig. 4. This example is easy because the object to be removed is surrounded by a single 
weakly textured region 



the inpainting was increased only marginally if the parameters were fine-tuned 
for each image separately. Automatic content based parameter determination 
resulted in better quality only in a few cases but produced serious artifacts 
more often. Besides, not all parameters may be chosen independently, i.e., if £ 
is increased then so should be a. 



3.2 Future Work 

As already pointed out in [21] each separate step of the inpainting algorithm 
could be performed with several different subalgorithms. Since it is unlikely that 
one combination performs optimally, it would be desirable to have a criterion for 
automatically choosing the appropriate algorithms. This criterion would have to 
include the form of the inpainting domain, image contents, amount of texture 
and probably more. 





222 



H. Grossauer 




Fig. 5. The airbrushed image during each step of the algorithm. First row: the ori- 
ginal image and the mask, with the inpainting domain being white. Second row: 
the filtered image and the difference image (i.e. the texture part). Third row: the 
inpainted geometry part before (left) and after (right) postprocessing. Fourth row: 
result of the segmentation and final result of the complete inpainting algorithm. Note 
that compared to figure 2 textures are synthesized using only appropriate information 
from the surrounding region. Unfortunately the salient brick pattern on the edge of 
the building was not correctly recognized by the segmentation 
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Abstract. Recently, a 3D face recognition approach based on geometric 
invariant signatures, has been proposed. The key idea is a representation 
of the facial surface, invariant to isometric deformations, such as those 
resulting from facial expressions. One important stage in the construc- 
tion of the geometric invariants involves in measuring geodesic distances 
on triangulated surfaces, which is carried out by the fast marching on 
triangulated domains algorithm. 

Proposed here is a method that uses only the metric tensor of the surface 
for geodesic distance computation. That is, the explicit integration of the 
surface in 3D from its gradients is not needed for the recognition task. 
It enables the use of simple and cost-efficient 3D acquisition techniques 
such as photometric stereo. Avoiding the explicit surface reconstruction 
stage saves computational time and reduces numerical errors. 



1 Introduction 

One of the challenges in face recognition is finding an invariant representation 
for a face. That is, we would like to identify different instances of the same 
face as belonging to a single subject. Particularly important is the invariance to 
illumination conditions, makeup, head pose, and facial expressions - which are 
the major obstacles in modern face recognition systems. 

A relatively new trend in face recognition is an attempt to use 3D imaging. 
Besides a conventional face picture, three dimensional images carry all the infor- 
mation about the geometry of the face. The usage of this information, or part of 
it, can potentially make face recognition systems less sensitive to illumination, 
head orientation and facial expressions. 

In 1996, Gordon showed that combining frontal and profile views can improve 
recognition accuracy [1]. This idea was extended by Beumier and Acheroy, who 
compared central and lateral profiles from the 3D facial surface, acquired by a 
structured light range camera [2]. This approach demonstrated some robustness 
to head orientations. Another attempt to cope with the problem of head pose 
was presented by Huang et al. using 3D morphable head models [3]. Mavridis 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3022, pp. 225-237, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




226 



A.M. Bronstein et al. 



et al. incorporated a range map of the face into the classical face recognition 
algorithms based on PCA and hidden Markov models [4] . Their approach showed 
robustness to large variations in color, illumination and use of cosmetics, and it 
also allowed separating the face from a cluttered background. 

Recently, Bronstein, Bronstein, and Kimmel [5] introduced a new approach 
which is also able to cope with problems resulting from the non-rigid nature of 
the human face. They applied the bending invariant canonical forms proposed in 
[6] to the 3D face recognition problem. Their approach is based on the assump- 
tion that most of human facial expressions are near-isometric transformations 
of the facial surface. The facial surface is converted into a representation, which 
is invariant under such transformations, and thus yields practically identical 
signatures for different postures of the same face. 

One of the key stages in the construction of the bending invariant repre- 
sentation is the computation of the geodesic distances between points on the 
facial surface. In [5], geodesic distances were computed using the Fast Marching 
on Triangulated Domains (FMTD) algorithm [7]. A drawback of this method 
is that it requires a polyhedral representation of the facial surface. Particularly, 
in [5] a coded-light range camera producing a dense range image was used [9]. 
Commercial versions of such 3D scanner are still expensive. 

In this paper, we propose 3D face recognition based on simple and cheap 
3D imaging methods, that recover the local properties of the surface without 
explicitly reconstructing its shape in 3D. One example is the photometric stereo 
method, that first recovers the surface gradients. The main novelty of this paper 
is a variation of the FMTD algorithm, capable of computing geodesic distances 
given only the metric tensor of the surface. This enables us to avoid the classical 
step in shape from photometric stereo of integrating the surface gradients into 
a surface. 

In Section 2 we briefly review 3D imaging methods that recover the metric 
tensor of the surface before reconstructing the surface itself; Section 3 is dedica- 
ted to the construction of bending-invariant canonical forms [6] , and in Section 4 
we present our modified FMTD algorithm. Section 5 shows how 3D face reco- 
gnition works on photometric stereo data. Section 6 concludes the paper. 



2 Surface Acquisition 

The face recognition algorithm discussed in this paper treats faces as three- 
dimensional surfaces. It is therefore necessary to obtain first the facial surface 
of the subject that we are trying to recognize. 

Here, our main focus is on 3D surface reconstruction methods that recover 
local properties of the facial surface, particularly the surface gradient. 1 As we 
will show in the following sections, the actual surface reconstruction is not really 
needed for the recognition. 

1 The relationship between the surface gradient and the metric tensor of the surface 
is established in Section 4 in equations (16) and (18). 
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2.1 Photometric Stereo 

The photometric stereo technique consists of obtaining several pictures of the 
same subject in different illumination conditions and extracting the 3D geometry 
by assuming the Lambertian reflection model. We assume that the facial surface, 
represented as a function, is viewed from a given position along the z-axis. The 
object is illuminated by a source of parallel rays directed along l l (Figure 1). 

I l {x, y) = max(p(x,y)n(x,y) • Z*,0) , (1) 

where p(x , y ) is the object albedo, and n(x , y) is the normal to the object surface, 
given as 

„(„ _ i-z x {x,y),-z y (x,y), 1) 

^l + \\S7z(x,y)\\l 



f 

I 



VIEWER 




LIGHT i 



Fig. 1. 3D surface acquisition using photometric stereo 



Using matrix-vector notation, Eq. (2) can be rewritten as 



I(x,y) = max(Lu,0), 



( 3 ) 



where 
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( 4 ) 



and 



vi = -z x v 3 ; i> 2 = -ZyV 3 -, 



v 3 = 



p{x,y) 






ifwi 



( 5 ) 



Given at least 3 linearly independent illuminations {l l }iLi and the corresponding 
observations {P}^L 1 , one can reconstruct the values of Vc by pointwise least- 
squares solution 



v = tfl(x,y) , 



( 6 ) 
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where L' = (L T L) 1 L T denotes the Moore-Penrose pseudoinverse of L. When 
needed, the surface can be reconstructed by solving the Poisson equation 



(7) 



with respect to z, which is the minimizer of the integral measure 

J J (( z x - z x ) 2 + (z v - Zy ) 2 ) dxdy. 



Photometric stereo is a simple 3D imaging method, which does not require 
expensive dedicated hardware. The assumption of Lambertian reflection holds 
for most parts of the human face (except the hair and the eyes) and makes this 
method very attractive for 3D face recognition application. 



2.2 Structured Light 

Proesmans et al. [11] and Winkelbach and Wahl [12] proposed a shape from 2D 
edge gradients reconstruction technique, which allows to reconstruct the surface 
normals (gradients) from two stripe patterns projected onto the object. The 
reconstruction technique is based on the fact that directions of the projected 
stripes in the captured 2D images depend on the local orientation of the surface 
in 3D. Classical edge-detecting operators can be used to find the direction of the 
stripe edges. 

Figure 2 describes the relation between the surface gradient and the local 
stripe direction. A pixel in the image plane defines the viewing vector s. The 
stripe direction determines the stripe direction vector v 1 , lying in both the image 
plane and in the viewing plane. The real tangential vector of the projected stripe 
V\ is perpendicular to the normal c = v' x s of the viewing plane and to the 
normal p of the stripe projection plane. Assuming parallel projection, we obtain 

iq = cx p . (8) 

Acquiring a second image of the scene with a rotated stripe illumination relative 
to the first one, allows to calculate a second tangential vector V 2 - Next, the 
surface normal is computed according to 



n = iq x V 2 ■ (9) 

In [13], Winkelbach and Wahl propose to use a single lighting pattern to estimate 
the surface normal from the local directions and widths of the projected stripes. 

3 Bending- Invariant Representation 

Human face can not be considered as a rigid object since it undergoes deforma- 
tions resulting from facial expressions. On the other hand, the class of transfor- 
mations that a facial surface can undergo is not arbitrary, and a suitable model 
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PROJECTION 

PLANE 



Fig. 2. 3D surface acquisition using structured light 



for facial expressions is of isometric (or length-preserving) transformations [5]. 
Such transformations do not stretch or tear the surface, or more rigorously, pre- 
serve the surface metric. In face recognition application, faces can be thought of 
as an equivalence classes of surfaces obtained by isometric transformations. Un- 
fortunately, classical surface matching methods, based on finding an Euclidean 
transformation of two surfaces which maximizes some shape similarity criterion 
(see, for example, [15], [16], [17]) usually fail to find similarities between two 
isometrically-deformed objects. 

In [6], Elad and Kimmel introduced a deformable surface matching method, 
referred to as bending-invariant canonical forms, which was adopted in [5] for 
3D face recognition. The key idea of this method is computation of invariant 
representations of the deformable surfaces, and then application of a rigid surface 
matching algorithm on the obtained invariants. We give a brief description of 
the method, necessary for the elaboration in Section 4. 

Given a polyhedral approximation of the facial surface, S. One can think 
of such an approximation as if obtained by sampling the underlying continuous 
surface with a finite set of points {p,:}” =1 , and discretizing the metric associated 
with the surface 

S(Pi,Pj) = Sij . (10) 

We define the matrix of squared mutual distances, 

(^)« = 4 ■ ( 11 ) 

The matrix A is invariant under isometric surface deformations, but is not a uni- 
que representation of isometric surfaces, since it depends on arbitrary ordering 
and the selection of the surface points. We would like to obtain a geometric inva- 
riant, which would be unique for isometric surfaces on one hand, and will allow 
using simple rigid surface matching algorithms to compare such invariants on the 
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other. Treating the squared mutual distances as a particular case of dissimilari- 
ties, one can apply a dimensionality-reduction technique called multidimensional 
scaling (MDS) in order to embed the surface points with their geodesic distances 
in a low-dimensional Euclidean space lR m [10], [14], [6]. 

In [5] a particular MDS algorithm, the classical scaling , was used. The em- 
bedding into R m is performed by first double-centering the matrix A 

B=~-JAJ (12) 

n 

(here J = I — I is a n x n identity matrix, and U is a matrix of ones). Then, 
the first m eigenvectors e,, corresponding to the m largest eigenvalues of B , are 
used as the embedding coordinates 

x l= e i', i = 1, ■■■TV, j = 1, ..., to . (13) 

where xj denotes the j'-tlr coordinate of the vector X{. The set of points Xi 
obtained by the MDS is referred to as the bending-invariant canonical form of 
the surface; when m = 3, it can be plotted as a surface. Standard rigid surface 
matching methods can be used in order to compare between two deformable 
surfaces, using their bending-invariant representations instead of the surfaces 
themselves. Since the canonical form is computed up to a translation, rotation, 
and reflection transformation, in order to allow comparison between canonical 
forms, they must be aligned. This can be done by setting the first-order moments 
(center of mass) and the mixed second-order moments of the canonical form to 
zero (see [18]). 

4 Measuring Geodesic Distances 

One of the crucial steps in the construction of the canonical form of a given sur- 
face, is an efficient algorithm for the computation of geodesic distances on surfa- 
ces, that is, Sij. A numerically consistent algorithm for distance computation on 
triangulated domains, henceforth referred to as Fast Marching on Triangulated 
Domains (FMTD), was used by Elad and Kimmel [6]. The FMTD was proposed 
by Kimmel and Sethian [7] as a generalization of the fast marching method [8]. 
Using FMTD, the geodesic distances between a surface vertex and the rest of 
the n surface vertices can be computed in O(n) operations. Measuring distances 
on manifolds was later done for graphs of functions [19] and implicit manifolds 
[ 20 ]. _ 

Since the main focus of this paper is how to avoid the surface reconstruction, 
we present a modified version of FMTD, which computes the geodesic distances 
on a surface, using the values of the surface gradient V z only. These values can 
be obtained, for example, from photometric stereo or structured light. 

The facial surface can be thought of as a parametric manifold, represented by 
a mapping X : 1R 2 — > IR 3 from the parameterization plane U = (u x ,u 2 ) = (x,y) 
to the manifold 



X(U ) = {x 1 ^ 1 ,u 2 ),x 2 {v} ,u 2 ),x 3 {u 1 ,u 2 )) ; 



(14) 
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which, in turn, can be written as 

X(U) = (x,y,z(x,y)) . (15) 

The derivatives of X with respect to u l are defined as X* = -J^jX , and they 
constitute a non-orthogonal coordinate system on the manifold (Figure 3). In 
the particular case of Eq. (15), 

X 1 (U) = (l,0,z x (x,y)y i X 2 (U) = (0,1, z y (x,y)) . (16) 

The distance element on the manifold is 



ds = 




(17) 



where we use Einstein’s summation convention, and the metric tensor g.y of the 
manifold is given by 



311 


3l2 




'x 1 -x l 


x 1 -x 2 


321 


322 _ 




x 2 ■ X 1 


x 2 -x 2 _ 



(18) 



The classical Fast Marching method [8] calculates distances in an orthogonal 
coordinate system. The numerical stencil for the update of a grid point consists 
of the vertices of a right triangle. In our case, g 12 7^ 0 and the resulting triangles 
are not necessarily right ones. If a grid point is updated by a stencil which is 
an obtuse triangle, a problem may arise. The values of one of the points of the 
stencil might not be set in time and cannot be used. There is a similar obstacle 
in Fast Marching on triangulated domains which include obtuse triangles [7] . 




U X(U) 

Fig. 3. The orthogonal grid on the parameterization plane U is transformed into a 
non-orthogonal one on the manifold X(U) 



Our solution is similar to that of [7]. We perform a preprocessing stage for 
the grid, in which we split every obtuse triangle into two acute ones (see Figure 
4). The split is performed by adding an additional edge, connecting the updated 
grid point with a non-neighboring grid point. The distant grid point becomes 
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part of the numerical stencil. The need for splitting is determined according to 
the angle between the non-orthogonal axes at the grid point. It is calculated by 



cos a = 



*!'*2 

ii*i lira 



5 12 

a/511522 



(19) 



If cos a = 0 , the axes are perpendicular, and no splitting is required. If cos a < 0, 
the angle a is obtuse and should be split. The denominator in the rhs of Eq. (19) 
is always positive, so we need only check the sign of the numerator g\ 2 . In order 
to split an angle, we should connect the updated grid point with another point, 
located m grid points from the point in the X\ direction, and n grid points in 
the X 2 direction (m and n can be negative). The point is a proper supporting 
point, if the obtuse angle is split into two acute ones. For cos a < 0 this is the 
case if 



cos/3i = 



mg n + ng 12 



X ! • (mXi + nX 2 ) _ 

||Xi||||?nXi + nX 2 || yj 5 n(m 2 5 n + 2mng 12 + n 2 g 2 2 ) 



and 



cos /?2 = 



mg 12 + ng 2 2 



X 2 • (ml! + nX 2 ) _ 

||*2||||?7i*i + nX 2 \\ ^g 22 (m 2 gii + 2mngi 2 + n 2 522) 



> 0 , ( 20 ) 



> 0 . (21) 



Also here, it is enough to check the sign of the numerators. For cos cc > 0, 
cos @2 changes its sign and the constraints are 



mgu + ngi 2 > 0; and mg \2 + ng 2 2 < 0 . (22) 



This process is done for all grid points. Once the preprocessing stage is done, we 
have a suitable numerical stencil for each grid point and we can calculate the 
distances. 

The numerical scheme used is similar to that of [7], with the exception that 
there is no need to perform the unfolding step. The supporting grid points that 
split the obtuse angles can be found more efficiently. The required triangle edge 
lengths and angles are calculated according to the surface metric gij at the grid 
point, which, in turn, is computed using the surface gradients z x , z y . A more 
detailed description appears in [22]. 



5 3D Face Recognition Using Photometric Stereo 
without Surface Reconstruction 

The modified FMTD method allows us to bypass the surface reconstruction 
stage in the 3D face recognition algorithm introduced in [5]. Instead, the values 
of the facial surface gradient X z is computed on a uniform grid using one of the 
methods discussed in Section 2 (see Figure 5). At the second stage, the raw data 
are preprocessed as proposed in [5]; in that paper, the preprocessing stage was 
limited to detecting the facial contour and cropping the parts of the face outside 
the contour. 
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Fig. 4. The numerical support for the non-orthogonal coordinate system. Triangle 1 
gives a proper numerical support, yet triangle 2 is obtuse. It is replaced by triangle 3 
and triangle 4 





Fig. 5. Surface gradient field (left), reconstructed surface (center) and its bending- 
invariant canonical form represented as a surface (right) 



Next, an n x n matrix of squared geodesic distances is created by applying 
the modified FMTD from each of the n selected vertices of the grid. Then, MDS 
is applied to the distance matrix, producing a canonical form of the face in a 
low-dimensional Euclidean space (three-dimensional in all our experiments) . The 
obtained canonical forms are compared using a rigid surface matching algorithm. 
Texture is not treated in this paper. 

As in [5], the method of moments described in [18] was used for rigid surface 
matching. The (p, q , r)-th moment of a three-dimensional surface is given by 

M pqr = ^(xinxinxir , (23) 

n 

where x l n denotes the i-tli coordinate of the n-tlr point in the surface sam- 
ples. In order to compare between two surfaces, the vector of first M moments 
(M piqiri , ..., M PMqMrM ), termed as the moment signature , is computed for each 
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(0°,0°) (0°,-20°) (0°,+20°) (-25°, 0°) (+25°,0°) 



Fig. 6. A face from Yale Database B, acquired with different illuminations. Numbers 
in brackets indicate the azimuth and the elevation angle, respectively, determining the 
illumination direction 



signature surface. The Euclidean distance between two moment signatures mea- 
sures the dissimilarity between the two surfaces. 



5.1 Experimental Results 

In order to exemplify our approach, we performed an experiment, which de- 
monstrates that comparison of canonical forms obtained without actual facial 
surface reconstruction, is in some cases, better than reconstruction and direct 
(rigid) comparison of the surfaces. It must be stressed that the purpose of the 
example is not to validate the 3D face recognition accuracy (which has been 
previously performed in [5]), but rather to test the feasibility of the proposed 
modified FMTD algorithm together with photometric stereo. 

The Yale Face Database B [21] was used for the experiment. The database 
consisted of high-resolution grayscale images of different instances of 10 subjects 
of both Caucasian and Asian type, taken in controlled illumination conditions 
(Figure 6 ). Some instances of 7 subjects were taken from the database for the 
experiment. Direct surface matching consisted of the retrieval of the surface gra- 
dient according to Eq. ( 6 ) using 5 different illumination directions, reconstruction 
of the surface according to Eq. (7), alignment and computation of the surface 
moments signature according to Eq. (23) . Canonical forms were computed from 
the surface gradient, aligned and converted into a moment signature according 
to Eq. (23). 

In order to get some notion of the algorithms accuracy, we converted the 
relative distances between the subjects produced by each algorithm into 3D 
proximity patterns (Figure 7). These patterns, representing each subject as a 
point in IR 3 , were obtained by applying MDS to the relative distances (with 
a distortion of less than 1%). The entire cloud of dots was partitioned into 
clusters formed by instances of the subjects Cf-CV- Visually, the more C) are 
compact and distant from other clusters, the more accurate is the algorithm. 
Quantitatively, we measured (i) the variance eq of C\ and (ii) the distance di 
between the centroid of Cj and the centroid of the nearest cluster. Table 1 shows 
a quantitative comparison of the algorithms. Inter-cluster distances di are given 
in units of the variance eq. Clusters C 5 -C 7 , consisting of a single instance of the 
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subject are not presented in the table. The use of canonical forms improved the 
cluster variance and the inter-cluster distance by about one order of magnitude, 
compared to direct facial surface matching. 



Table 1. Properties of face clusters in Yale Database B using direct surface matching 
(dir) and canonical forms (can), cj is the variance of the cluster and d is the distance 
to the nearest cluster. 



Cluster 


& dir 


ddir 


& can 


dean 


Ci 


0.1749 


0.1704 


0.0140 


4.3714 


c 2 


0.2828 


0.3745 


0.0120 


5.1000 


c 3 


0.0695 


0.8676 


0.0269 


2.3569 


c 4 


0.0764 


0.7814 


0.0139 


4.5611 




Fig. 7. Visualization of the face recognition results as three-dimensional proximity 
patterns. Subjects from the face database represented as points obtained by applying 
MDS to the relative distances between subjects. Shown here: straightforward surface 
matching (A) and canonical forms (B) 



6 Conclusions 

We have shown how to perform face recognition according to [5] , without recon- 
structing the 3D facial surface. We used a modification of the Kimmel-Sethian 
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FMTD algorithm for computation of geodesic distances between points on the 
facial surface using only the surface metric tensor at each point. Our approach 
allows to use simple and efficient 3D acquisition techniques like photometric 
stereo for fast and accurate face recognition. Experimental results demonstrate 
feasibility of our approach for the task of face recognition. 
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Abstract. Mean shift is a nonparametric estimator of density which has 
been applied to image and video segmentation. Traditional mean shift 
based segmentation uses a radially symmetric kernel to estimate local 
density, which is not optimal in view of the often structured nature of 
image and more particularly video data. In this paper we present an 
anisotropic kernel mean shift in which the shape, scale, and orientation 
of the kernels adapt to the local structure of the image or video. We 
decompose the anisotropic kernel to provide handles for modifying the 
segmentation based on simple heuristics. Experimental results show that 
the anisotropic kernel mean shift outperforms the original mean shift 
on image and video segmentation in the following aspects: 1) it gets 
better results on general images and video in a smoothness sense; 2) the 
segmented results are more consistent with human visual saliency; 3) the 
algorithm is robust to initial parameters. 



1 Introduction 

Image segmentation refers to identifying homogenous regions in the image. Vi- 
deo segmentation, in this paper, means the joint spatial and temporal analysis 
on video sequences to extract regions in the dynamic scenes. Both of these tasks 
are misleadingly difficult and have been extensively studied for several decades. 
Refer to [9,10,11] for some good surveys. Generally, spatio-temporal video seg- 
mentation can be viewed as an extension of image segmentation from a 2D to a 
3D lattice. Recently, mean shift based image and video segmentation has gained 
considerable attention due to its promising performance. 

Many other data clustering methods have been described in the literature, 
ranging from top down methods such as K-D trees, to bottom up methods such 
as K-means and more general statistical methods such as mixtures of Gaussians. 
In general these methods have not performed satisfactorily for image data due 
to their reliance on an a priori parametric structure of the data segment, and/or 
estimates of the number of segments expected. Mean shift’s appeal is derived 
from both its performance and its relative freedom from specifying an expected 
number of segments. As we will see, this freedom has come at the cost of having 
to specify the size (bandwidth) and shape of the influence kernel for each pixel 
in advance. 

The difficulty in selecting the kernel was recognized in [3,4,12] and was ad- 
dressed by automatically determining a bandwidth for spherical kernels. These 
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approaches are all purely data driven. We will leverage this work and extend it 
to automatically select general elliptical (anisotropic) kernels for each pixel. We 
also add a priori knowledge about typical structures found in video data to take 
advantage of the extra freedom in the kernels to adapt to the local structure. 



1.1 Mean Shift Based Image and Video Segmentation 

Rather than begin from an initial guess at the segmentation, such as seeding 
points in the K-means algorithm, mean shift begins at each data point (or pixel 
in an image or video) and first estimates the local density of similar pixels (i.e., 
the density of nearby pixels with similar color). As we will see, carefully defining 
“nearby” and “similar” can have an important impact on the results. This is the 
role the kernel plays. 

More specifically, mean shift algorithms estimate the local density gradient 
of similar pixels. These gradient estimates are used within an iterative procedure 
to find the peaks in the local density. All pixels that are drawn upwards to the 
same peak are then considered to be members of the same segment. 

As a general nonparametric density estimator, mean shift is an old pattern 
recognition procedure proposed by Fukunage and Hostetler [7], and its efficacy 
on low-level vision tasks such as segmentation and tracking has been extensively 
exploited recently. In [1,5], it was applied for continuity preserving filtering and 
image segmentation. Its properties were reviewed and its convergence on lattices 
was proven. In [2], it was used for non-rigid objects tracking and a sufficient 
convergence condition was given. Applying mean shift on a 3D lattice to get a 
spatio-temporal segmentation of video was achieved in [6], in which a hierarchical 
strategy was employed to cluster pixels of 3D space-time video stack, which were 
mapped to 7D feature points (position(2), time(l), color(3), and motion(l)). 

The application of mean shift to an image or video consists of two stages. The 
first stage is to define a kernel of influence for each pixel x%. This kernel defines 
a measure of intuitive distance between pixels, where distance encompasses both 
spatial (and temporal in the case of video) as well as color distance. Although 
manual selection of the size (or bandwidth) and shape of the kernel can produce 
satisfactory results on general image segmentation, it has a significant limitation. 
When local characteristics of the data differ significantly across the domain, it is 
difficult to select globally optimal bandwidtlrs. As a result, in a segmented image 
some objects may appear too coarse while others are too fine. Some efforts have 
been reported to locally vary the bandwidth. Singh and Alruja [12] determine 
local bandwidtlrs using Parzen windows to mimic local density. Another variable 
bandwidth procedure was proposed in [3] in which the bandwidth was enlarged 
in sparse regions to overcome the noise inherent with limited data. 

Although the size may vary locally, all the approaches described above used a 
radially symmetric kernel. One exception is the recent work in [4] that describes 
the possibility of using the general local covariance to define an asymmetric 
kernel. However, this work goes on to state, “Although a fully parameterized 
covariance matrix can be computed.., this is not necessarily advantageous..” 
and then returns to the use of radially symmetric kernels for reported results. 
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The second iterative stage of the mean shift procedure assigns to each pixel a 
mean shift point, initialized to coincide with the pixel. These mean shift 

points are then iteratively moved upwards along the gradient of the density fun- 
ction defined by the sum of all the kernels until they reach a stationary point (a 
mode or hilltop on the virtual terrain defined by the kernels) . The pixels associa- 
ted with the set of mean shift points that migrate to the (approximately) same 
stationary point are considered to be members of a single segment. Neighboring 
segments may then be combined in a post process. 

Mathematically, the general multivariate kernel density estimate at the point, 
x, is defined by 

1 n 

f(x ) = - E K h{x - Xi) (1) 

i = 1 

where the n data points Xi represent a sample from some unknown density /, 
or in the case of images or video, the pixels themselves. 

K h (x) = \H\-^K(H~^x) (2) 

where K (;?) is the d- variate kernel function with compact support satisfying the 
regularity constraints as described in [13], and H is a symmetric positive definite 
d x d bandwidth matrix. For the radially symmetric kernel, we have 

K(z) = ck(\\z\\ 2 ) (3) 



where c is the normalization constant. If one assumes a single global spherical 
bandwidth, H = hr I, the kernel density estimator becomes 



/ 0 ) 



1 

n(h) d 







( 4 ) 



For image and video segmentation, the feature space is composed of two inde- 
pendent domains: the spatial /lattice domain and the range/color domain. We 
map a pixel to a multi-dimensional feature point which includes the p dimen- 
sional spatial lattice (p = 2 for image and p = 3 for video) and q dimensional 
color (g = 3 for L*u*v color space). Due to the different natures of the domains, 
the kernel is usually broken into the product of two different radially symmetric 
kernels (superscript s will refer to the spatial domain, and r to the color range): 



Kh‘M (x) 



c 

( h s )P{h r )i 



k s 




( 5 ) 



where X s and x r are respectively the spatial and range parts of a feature vector, 
k s and k r are the profiles used in the two domains, h s and h r are employed 
bandwidths in two domains, and c is the normalization constant. With the kernel 
from (5), the kernel density estimator is 



f(x) 
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n{h s Y{h r Y 



E* s 




x — x\ 
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( 6 ) 




Image and Video Segmentation by Anisotropic Kernel Mean Shift 241 



As apparent in Equations (5) and (6), there are two main parameters that 
have to be defined by the user for the simple radially symmetric kernel based 
approach: the spatial bandwidth h s and the range bandwidth h r . In the variable 
bandwidth mean shift procedure proposed in [3], the estimator (6) is changed to 



1 y" c f ( 


X s - x\ 


2 \ k r ( 


x r — x\ 




K 


) l 


hi 



( 7 ) 



There are now important differences between (6) and (7). First, potentially dif- 
ferent band widths hf and are assigned to each pixel, x k , as indicated by the 
subscript i. Second, the different bandwidths associated with each point appear 
within the summation. This is the so-called sample point estimator [3], as oppo- 
sed to the balloon estimator defined in Equation (6). The sample point estimator, 
which we will refer to as we proceed, ensures that all pixels respond to the same 
global density estimation during the segmentation procedure. Note that the sam- 
ple point and balloon estimators are the same in the case of a single globally 
applied bandwidth. 



1.2 Motivation for an Anisotropic Kernel 

During the iterative stage of the mean shift procedure, the mean shift points 
associated with each pixel climb to the hilltops of the density function. At each 
iteration, each mean shift point is attracted in varying amounts by the sample 
point kernels centered at nearby pixels. More intuitively, a kernel represents a 
measure of the likelihood that other points are part of the same segment as 
the point under the kernel’s center. With no a priori knowledge of the image 
or video, actual distance (in space, time, and color) seems an obvious (inverse) 
correlate for this likelihood; the closer two pixels are to one another the more 
likely they are to be in the same segment. 

We can, however, take advantage of examining a local region surrounding 
each pixel to select the size and shape of the kernel. Unlike [3], we leverage the 
full local covariance matrix of the local data to create a kernel with a general 
elliptical shape. Such kernels adapt better to non-compact (i.e., long skinny) local 
features such as can be seen in the monkey bars detail in Figure 2 and the zebra 
stripes in Figure 5. Such features are even more prevalent in video data from 
stationary or from slowly or linearly moving cameras. When considering video 
data, a spatio-temporal slice (parallel to the temporal axis) is as representative 
of the underlying data as any single frame (orthogonal to the temporal axis). 
Such a slice of video data exhibits stripes with a slope relative to the speed at 
which objects move across the visual field (see Figures 3 and 4). The problems 
in the use of radially symmetric kernels is particularly apparent in these spatio- 
temporal slice segmentations. The irregular boundaries between and across the 
stripe-like features cause a lack of temporal coherence in the video segmentation. 

An anisotropic kernel can adapt its profile to the local structure of the data. 
The use of such kernels proves more robust, and is less sensitive to initial pa- 
rameters compared with symmetric kernels. Furthermore, the anisotropic kernel 
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provides a set of handles for application-driven segmentation. For instance, a 
user may desire that the still background regions be more coarsely segmented 
while the details of the moving objects to be preserved when segmenting a vi- 
deo sequence. To achieve this, we simply expand those local kernels (in the 
color and/or spatial dimensions) whose profiles have been elongated along the 
time dimension. By providing a set of heuristic rules described below on how 
to modulate the kernels, the segmentation strategy can be adapted to various 
applications. 

2 Anisotropic Kernel Mean Shift 

2.1 Definition 

The Anisotropic Kernel Mean Shift associates with each data point (a pixel in an 
image or video) an anisotropic kernel. The kernel associated with a pixel adapts 
to the local structure by adjusting its shape, scale, and orientation. Formally, 
the density estimator is written as 

1 n 1 

where g(x s ,xf,Hf) is the Mahalanobis metric in the spatial domain: 

g(x s ,x S i,Hf) = (xf -x s ) T Hf~ 1 {x s i - x s ) (9) 

In this paper we use a spatial kernel with a constant profile, k s (z ) = 1 if 
\z\ < 1, and 0 otherwise. For the color domain we use an Epanechnikov kernel 
with a profile k r (z ) = 1— \z\ if \z\ < 1 and 0 otherwise. Note that in our definition, 
the bandwidth in color range h r is a function of the bandwidth matrix in space 
domain Hf. Since Hf is determined by the local structure of the video, h r thus 
varies from one pixel to another. Possibilities on how to modulate h r according 
to H s will be discussed later. 

The bandwidth matrix Hf is symmetric positive definite. If it is simplified 
into a diagonal matrix with equal diagonal elements, (i.e., a scaled identity), then 
Hf models the radially symmetric kernels. In the case of video data, the time 
dimension may be scaled differently to represent notions of equivalent “distance” 
in time vs. image space. In general, allowing the diagonal terms to be scaled 
differently allows for the kernels to take on axis aligned ellipsoidal shapes. A full 
Hf matrix provides the freedom to model kernels of a general ellipsoidal shape 
oriented in any direction. The Eigen vectors of Hf will point along the axes of 
such ellipsoids. We use this additional freedom to shape the kernels to reflect 
local structures in the video as described in the next section. 

2.2 Kernel Modulation Strategies 

Anisotropic kernel mean shift give us a set of handles on modulating the kernels 
during the mean shift procedure. How to modulate the kernel is application re- 
lated and there is not an uniform theory for guidance. We provide some intuitive 
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(8) 
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heuristics for video data with an eye towards visually salient segmentation. In 
the case of video data we want to give long skinny segments at least an equal 
chance to form as more compact shapes. These features often define the salient 
features in an image. In addition, they are often very prominent features in the 
spatio-temporal slices as can be seen in many spatio-temporal diagrams. In parti- 
cular, we want to recognize segments with special properties in the time domain. 
For example, we may wish to allow static objects to form into larger segments 
while moving objects to be represented more finely with smaller segments. 

An anisotropic bandwidth matrix Hf is first estimated starting from a stan- 
dard radially symmetric diagonal Hf and color radius h r . The neighborhood of 
pixels around x j is defined by those pixels whose position, x, satisfies 



k s (g(x, Xi ,H°))< 1; k r 




< 1 



(10) 



An analysis of variance of the points within the neighborhood of Xi provides a 
new full matrix H '? that better describes the local neighborhood of points. 

To understand how to modulate the full bandwidth matrix H?, it is useful 
to decompose it as 

H? = A DAD t (11) 

where A is a global scalar, D is a matrix of normalized Eigen vectors, and A is 
a diagonal matrix of Eigen values which is normalized to satisfy: 

p 

n°i =i ( i2 ) 

i=l 



where a* is the i th diagonal elements of A , and Oj > aj, for i < j. Thus, A defines 
the overall volume of the new kernel, A defines the relative lengths of the axes, 
and D is a rotation matrix that orients the kernel in space and time. 

We now have intuitive handles for modulating the anisotropic kernel. The 
D matrix calculated by the covariance analysis is kept unchanged during the 
modulation process to maintain the orientation of the local data. By adjusting 
A and A, we can control the spatial size and shape of the kernel. For example, 
we can encourage the segmentation to find long skinny regions by diminishing 
the smaller Eigen values in A as 
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(13) 



In this way the spatial kernel will stretch more in the direction in which the 
object elongates. To create larger segments for static objects we detect kernels 
oriented along the time axis as follows. First, a scale factor s t is computed as 



p - 1 

St = a + (1 - a) JJ di(t) 2 

i = 1 



(14) 




